1.Introduction

a. Problem Statement

Is there a significant difference in income between men and women? Does the difference vary depending on other factors discussed below?

To address this problem, we will use the NLSY97 (National Longitudinal Survey of Youth, 1997 cohort) data set.

b. Changing column names

We have changed column names for 19 variables which we believe to have an impact on income gap between male and female

c. Creating a subset for selected columns

Below is the list of shortlisted variables and the reasons behind their selection

  1. total.incarcerations: Imprisonment gives information on criminal history that can affect job and income levels
  2. marijuana: Drug use such as of marijuana may affect day to day performance and income
  3. sex: We want to use this in our hypothesis to assess income gap between male and female
  4. marital.status: Consider whether marital status has any effect on income gap between male and female. For example, a divorced individual may have to work twice as hard to make ends meet possibly leading to a higher income.
  5. high.school.diploma: High school diploma may determine an individual’s entry in formal or informal job market
  6. parenthood.by.20: Early parenthood can affect job opportunities and income levels due to greater child-rearing responsibilities at home
  7. chance.college.degree: Consider whether having a college degree increases income prospects for male and female
  8. hard.times: Hard times can affect performance at work and hence affect income levels
  9. work.limitation: Physical, mental and emotional conditions that limits one’s performance at work
  10. citizenship: US citizenship may increase one’s chance of landing a higher salary paying job in the US
  11. mother.edu: Individual’s background such as parents’ education may have impact on individual’s income
  12. father.edu: Individual’s background such as parents’ education may have impact on individual’s income
  13. race: Individuals from different races may face different income levels
  14. hard.drugs: Hard drugs can affect an individual’s day to day functioning and affect income
  15. public.private: Attending a public vs. private education institute may affect income
  16. highest.degree: Highest degree achieved may also explain difference in income
  17. emp_industry: Belonging to a particular job industry may affect income
  18. job.type: Belonging to a particular job type may affect income
  19. income:We want to use income to assess income gaps between male and female

d. Understanding the data

After creating a subset, we can see that all our selected variables are numeric. We can do so by running a str command in our analysis. Our renamed_nlsy1 dataset returns 5053 rows and 19 columns. The original data has 8984 rows and 95 variables/ columns. As we placed a filter command on our income and total.incarcerations variable, i.e. greater than zero, the dataset drops all rows with income and total.incarcerations less than zero and returns 5053 rows. The negatives values in the dataset provide data points for missing values (-1 for refusal, -2 dont know, -3 invalid skip, -4 valid skip, -5 non-interview). Since all these values do not record any income level or incarcerations, we decided to let go of them.

Currently, we can see that the data types are set to numeric for all variables. Their category labels are listed in the variable codebook which will be handled later in data cleaning part.

renamed_nlsy1 <- nlsy %>%
  filter( income > 0 & total.incarcerations >= 0) %>%
  select(
  "total.incarcerations", 
  "marijuana",
  "sex", 
  "marital.status",
  "high.school.diploma", 
  "parenthood.by.20", 
  "chance.college.degree", 
  "hard.times", 
  "work.limitation", 
  "citizenship", 
  "mother.edu",
  "father.edu",
  "race", 
  "hard.drugs", 
  "public.private", 
  "highest.degree", 
  "emp_industry_2011", 
  "job.type", 
  "income")
## [1] 5053   19
## tibble [5,053 × 19] (S3: tbl_df/tbl/data.frame)
##  $ total.incarcerations : num [1:5053] 0 0 0 0 0 0 0 0 0 0 ...
##  $ marijuana            : num [1:5053] 0 0 1 0 0 0 0 0 0 0 ...
##  $ sex                  : num [1:5053] 1 2 1 1 1 2 1 1 1 1 ...
##  $ marital.status       : num [1:5053] 0 0 1 1 1 1 0 0 0 0 ...
##  $ high.school.diploma  : num [1:5053] -4 100 -4 -4 -4 -4 95 -4 -4 -4 ...
##  $ parenthood.by.20     : num [1:5053] -4 0 -4 -4 -4 -4 0 -4 -4 -4 ...
##  $ chance.college.degree: num [1:5053] -4 100 -4 -4 -4 -4 95 -4 -4 -4 ...
##  $ hard.times           : num [1:5053] -4 0 0 0 0 0 0 -4 0 0 ...
##  $ work.limitation      : num [1:5053] -4 1 0 0 0 0 0 -4 0 0 ...
##  $ citizenship          : num [1:5053] -4 3 3 3 3 1 1 -4 3 1 ...
##  $ mother.edu           : num [1:5053] 15 12 12 12 12 14 12 6 11 15 ...
##  $ father.edu           : num [1:5053] 14 -4 12 6 6 -4 12 -4 -4 -4 ...
##  $ race                 : num [1:5053] 2 2 2 4 4 2 2 2 2 1 ...
##  $ hard.drugs           : num [1:5053] 0 0 0 0 0 0 0 0 0 0 ...
##  $ public.private       : num [1:5053] -4 -4 -4 2 -5 3 -4 -4 -4 -4 ...
##  $ highest.degree       : num [1:5053] 2 2 2 5 -5 2 1 0 3 1 ...
##  $ emp_industry_2011    : num [1:5053] 9470 8180 9470 7860 -5 8190 4470 8680 -4 -4 ...
##  $ job.type             : num [1:5053] 3910 4840 3740 230 1000 5860 4850 4110 4650 3500 ...
##  $ income               : num [1:5053] 116000 45000 125000 59000 75000 36000 63000 35000 20000 55000 ...

2. Cleaning the data

  1. For better data analysis, we have changed numeric data types for all columns to factor types except for total incarcerations and income. These factor data type will help us create categories as listed in the variable codebook. The following steps were implemented:
  1. Set functions to create and reorder categories as listed in the variable codebook
  2. Apply functions on each variable to reorganize the column
  3. Mutate each variable to as.factor to convert from numeric to factor data type.

#handle emp_industry and job type - discuss with everyone

  1. For factor variables, negatives values were treated such that refusal(-1), don’t know(-2), invalid skip (-3), valid skip (-4), non-interview(-5) were assigned to the missing category. After looking into each negative value’s description, we see that while refusals could convey meaningful information for variables such as marijuana use, we have still grouped them as missing due to their negligible count relative to other categories for that variable. This helped us in getting a count of missing values for each variable. Moving forward, we decided to drop variables with a large % of missing. We set our threshold to 40% which meant that if missing values comprise more than 40% of a variable, it would not convey any meaningful information for the chosen variables. By this measure, we decided to let go of high.school.diploma, parenthood.by.20, chance.college.degree, and public.private.
## [1] "data.frame"
Count of Missing Values
x
total.incarcerations 0
marijuana 19
sex 0
marital.status 30
high.school.diploma 3116
parenthood.by.20 3130
chance.college.degree 3119
hard.times 560
work.limitation 557
citizenship 532
mother.edu 349
father.edu 1725
race 0
hard.drugs 235
public.private 4033
highest.degree 308
emp_industry_2011 749
job.type 125
income 0
## 'data.frame':    5053 obs. of  19 variables:
##  $ total.incarcerations : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ marijuana            : Factor w/ 3 levels "no","yes","missing": 1 1 2 1 1 1 1 1 1 1 ...
##  $ sex                  : Factor w/ 2 levels "Male","Female": 1 2 1 1 1 2 1 1 1 1 ...
##  $ marital.status       : Factor w/ 6 levels "Divorced","Married",..: 4 4 2 2 2 2 4 4 4 4 ...
##  $ high.school.diploma  : Factor w/ 12 levels "0%","1%-10%",..: 12 11 12 12 12 12 11 12 12 12 ...
##  $ parenthood.by.20     : Factor w/ 12 levels "0%","1%-10%",..: 12 1 12 12 12 12 1 12 12 12 ...
##  $ chance.college.degree: Factor w/ 12 levels "0%","1%-10%",..: 12 11 12 12 12 12 11 12 12 12 ...
##  $ hard.times           : Factor w/ 3 levels "missing","no",..: 1 2 2 2 2 2 2 1 2 2 ...
##  $ work.limitation      : Factor w/ 3 levels "missing","no",..: 1 3 2 2 2 2 2 1 2 2 ...
##  $ citizenship          : Factor w/ 4 levels "missing","unknown.birthplace",..: 1 2 2 2 2 4 4 1 2 4 ...
##  $ mother.edu           : Factor w/ 22 levels "missing","1ST GRADE",..: 16 13 13 13 13 15 13 7 12 16 ...
##  $ father.edu           : Factor w/ 22 levels "missing","1ST GRADE",..: 15 1 13 7 7 1 13 1 1 1 ...
##  $ race                 : Factor w/ 4 levels "Black","Hispanic",..: 2 2 2 4 4 2 2 2 2 1 ...
##  $ hard.drugs           : Factor w/ 3 levels "missing","no",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ public.private       : Factor w/ 4 levels "missing","Private for-profit institution",..: 1 1 1 3 1 2 1 1 1 1 ...
##  $ highest.degree       : Factor w/ 9 levels "Associate/Junior college (AA)",..: 4 4 4 5 6 4 3 7 1 3 ...
##  $ emp_industry_2011    : Factor w/ 18 levels "ACS SPECIAL CODES",..: 14 5 14 5 11 5 18 6 11 11 ...
##  $ job.type             : Factor w/ 33 levels "ACS SPECIAL CODES",..: 28 29 28 10 20 24 29 13 25 14 ...
##  $ income               : num  116000 45000 125000 59000 75000 36000 63000 35000 20000 55000 ...

3. Exploratory analysis and summaries

a. Getting to know our two main variables

Using summary, we can see that median income is 40,000 while mean is 49,781. The lower quartile and upper quartlie is 25,000 and 62,000 respectively. As mean > median, we can conclude that the data is right-skewed which could be explained by the topcoded values. This analysis is further verified as we plot our density graph for male and female. We also see that female density is greater than males for lower income levels. Using table for sex, we can observe that male and female count is 2600 and 2453 respectively which roughly divides our sample equally over two genders. As seen in the histogram plotted for income within each gender, we see that the distribution of income is not normal for both Males and Females. Similarly, we do see presence of outliers which we believe are due to topcoded income values. Moreover, upon a closer look at the graph, we see that men have their values clustered around a higher income value compared to females.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2   25000   40000   49781   62000  235884
## 
##                     Black                  Hispanic Mixed Race (Non-Hispanic) 
##                      1242                      1048                        47 
##  Non-Black / Non-Hispanic 
##                      2716

b. Dealing with top-coded values

121 of the incomes are topcoded to the maximum value of $235,884. This is the average value of top 121 income earners. We confirmed this using the slice_max function, which pulled out the top 2% rows from our renamed_nlsy1 dataset based on income. In order to understand the topcoded values and their implication on our analysis, we first drew a histogram to discern the distribution of income for each gender. We also constructed the QQ-plot to check if the data is normal. From the graphs, as expected, the distribution had a right skew due to the presence of outliers because of which the data did not appear to be normal. Given this scenario, we decided to remove the topcoded values to ensure that they do not distort our analysis. This gave birth to the renamed_nlsy dataset, and plotting the QQ-plot for the same allows us to see a relative improvement in the normality of the income distribution.

## [1] 4933

c. Testing statistical significance of each variable on income

1. sex

From the box plot, we can observe that the median value for income is higher for males than for females. To check whether sex has a statistically significant effect on income, we run a t.test and check p-values. As the p-value is less than 0.05, we can reject the null hypothesis and conclude that sex does affect income.This resonates with our intended analysis that sex does impact income.

## `summarise()` ungrouping output (override with `.groups` argument)
sex num.obs mean.income sd.income se.income
Male 2510 51137 29506 589
Female 2423 39159 26385 536
## 
##  Welch Two Sample t-test
## 
## data:  income by sex
## t = 15.041, df = 4902.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  10416.46 13538.83
## sample estimates:
##   mean in group Male mean in group Female 
##             51136.71             39159.06
2. marijuana

From the box plot, we can observe that the median income is the same whether marijuana was consumed or not. To check whether income gap is statistically significant between men and women within marijuana users and non-users, we run a t.test and visualize it with confidence interval bars in a bar graph. The bar graph with error bars shows us three things: i) The income gap between men and women is statistically significant across Marijuana users and non-users since the error bars do not contain the null hypothesis value of 0. ii) The positive bars serve as an evidence that men earn more than women in both user and non-user groups. iii) Since the error bar for users contains the height of the bar for non-users, we can say that the income gap for users is not statistically different from the non-users group. Finally, the missing category shows a wide confidence interval indicating data unreliability which is due to a low count of 19 missing values here and the income gap is also insignificant here.

## `summarise()` ungrouping output (override with `.groups` argument)
marijuana num.obs mean.income sd.income se.income
no 3949 45425 28659 456
yes 966 44652 28745 925
missing 18 39961 19254 4538

3. marital.status

Contrary to our expectation where we expected divorced individuals to work harder to make ends meet, we see in the box plot, we can observe that the median income is the highest for married individuals which possibly be happy household conditions which means doing well at home and at work. To check whether income gap is statistically significant between men and women within different marital.status groups, we run a t.test and visualize it with confidence interval bars in a bar graph. The bar graph with error bars shows us three things: i) The income gap between men and women is statistically insignificant for only widowed and separated individuals since the error bars contain the null hypothesis value of 0. ii) Only the separated marital group has negative bar indicating that women are earning more than men. iii) No single error bar overlaps the height of the other bar, hence we can conclude that our results are statistically significant. Finally, the missing category has a low count therefore we do not include it in our analysis.

## `summarise()` ungrouping output (override with `.groups` argument)
marital.status num.obs mean.income sd.income se.income
Divorced 506 40632 26422 1175
Married 2386 51038 29830 611
missing 30 37857 21206 3872
Never married 1890 39793 26412 608
Separated 104 38600 27296 2677
Widowed 17 31706 23105 5604
## `summarise()` ungrouping output (override with `.groups` argument)

4. highest.degree

We used a density plot to visualize the income distribution for each level of educational attainment. Each distribution seems to have a right long tail indicating skewness in income when we delve into each degree level. Moreover, we run the Anova test to check whether highest degree obtained impacts income and the p-value i.e. <2e-16 reflects that average income varies across degrees obtained.

##                  Df    Sum Sq   Mean Sq F value Pr(>F)    
## highest.degree    8 6.027e+11 7.533e+10   107.7 <2e-16 ***
## Residuals      4924 3.445e+12 6.996e+08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
5. hard.times

From the box plot, we can observe that the median income is higher for individuals who have not experienced hard times. To check whether income gap is statistically significant between men and women within group of individuals who have experienced hard.times and those who did not, we run a t-test and visualize it with confidence interval bars in a bar graph. The bar graph with error bars shows us three things: i) The income gap between men and women is statistically significant across those who have experienced hard times as well as those that have not, since the error bars do not contain the null hypothesis value of 0. ii) The positive bars serve as an evidence that men earn more than women in group of individuals who have experienced hard times as well as those who didnt. iii) Since the error bar from yes group contains the height of the bar from no group, we can say that the income gap for those who experienced hard times is not statistically different from those who did not.

## `summarise()` ungrouping output (override with `.groups` argument)
hard.times num.obs mean.income sd.income se.income
missing 550 44311 29457 1256
no 4155 45877 28779 446
yes 228 36160 21953 1454
## `summarise()` ungrouping output (override with `.groups` argument)

6. work.limitation

From the box plot, we can observe that the median income is higher for individuals who do not have any work limitations. To check whether income gap is statistically significant between men and women within group of individuals who do not have work limitations and those who do, we run a t-test and visualize it with confidence interval bars in a bar graph. The bar graph with error bars shows us three things: i) The income gap between men and women is statistically significant across those who do not have work limitations and those who do, since the error bars do not contain the null hypothesis value of 0. ii) The positive bars serve as an evidence that men earn more than women in group of individuals who have work limitations as well as those who do not. iii) Since the error bar from yes group contains the height of the bar from no group and vice versa, we can say that the income gap for those who have work limitations is not statistically different from those who do not.

## `summarise()` ungrouping output (override with `.groups` argument)
work.limitation num.obs mean.income sd.income se.income
missing 547 44379 29524 1262
no 4139 45909 28631 445
yes 247 36212 25254 1607
## `summarise()` ungrouping output (override with `.groups` argument)

7. mother.edu

From the box plot, we can observe that the median income is the highest for 1ST GRADE individuals. Given the minimal no. of observations for 1ST GRADE and UNGRADED, i.e. 4, we believe that t test would not have been accurate because the number of values were very small.The box plot is also spread out for 1ST GRADE due to the same issue. Hence, we decided to drop 1ST GRADE and UNGRADED for our statistical significance test. To check whether income gap is statistically significant between men and women within group of individuals who have different levels of mother education, we run a t-test and visualize it with confidence interval bars in a bar graph. The bar graph with error bars shows us three things: i) The income gap between men and women is statistically significant across have different levels of mother education except for 4th GRADE, since the error bars do not contain the null hypothesis value of 0. ii) The positive bars (excluding 4th GRADE) serve as an evidence that men earn more than women in group of individuals who have different levels of mother education. iii) Since the error bar from 2nd GRADE group contains the height of the bar from all other group, we can say that the income gap for those have different levels of mother education is not statistically different.

## `summarise()` ungrouping output (override with `.groups` argument)
mother.edu num.obs mean.income sd.income se.income
missing 342 39556 26393 1427
1ST GRADE 4 69250 50022 25011
2ND GRADE 15 35667 13937 3599
3RD GRADE 27 48330 31144 5994
4TH GRADE 32 38385 25667 4537
5TH GRADE 21 36810 22164 4837
6TH GRADE 123 40285 24965 2251
7TH GRADE 35 34357 20013 3383
8TH GRADE 97 35498 23955 2432
9TH GRADE 135 38437 25801 2221
10TH GRADE 212 36890 24699 1696
11TH GRADE 281 39335 24870 1484
12TH GRADE 1647 44024 27395 675
1ST YEAR COLLEGE 378 47374 29625 1524
2ND YEAR COLLEGE 579 45789 28683 1192
3RD YEAR COLLEGE 150 45774 28735 2346
4TH YEAR COLLEGE 548 54867 31410 1342
5TH YEAR COLLEGE 106 60845 32423 3149
6TH YEAR COLLEGE 132 57160 32803 2855
7TH YEAR COLLEGE 19 53837 28887 6627
8TH YEAR COLLEGE 46 64443 33716 4971
UNGRADED 4 21000 14071 7036
## `summarise()` ungrouping output (override with `.groups` argument)

8. citizenship

Contrary to our beliefs, we see that the median income is higher for individuals with unknown birthplace as compared to the rest. To check whether income gap is statistically significant between men and women within groups of individuals with different citizenships, we run a t-test and visualize it with confidence interval bars in a bar graph. The bar graph with error bars shows us three things: i) The income gap between men and women is statistically significant across individuals with different citizenships, since the error bars do not contain the null hypothesis value of 0. ii) The positive bars serve as an evidence that men earn more than women in group of individuals across different citizenships. iii) Since the error bar from anyone group contains the height of the bar from other groups, we can say that the income gap is not statistically different across citizenship.

## `summarise()` ungrouping output (override with `.groups` argument)
citizenship num.obs mean.income sd.income se.income
missing 522 44887 29779 1303
unknown.birthplace 407 47843 29062 1441
Unknown.not.us.born 128 48306 29488 2606
US.born.citizen 3876 44930 28410 456

9. total.incarcerations

We were interested in exploring the correlation between income and total incarcerations within each gender. To achieve this, we created a summary table which shows that for both men and women, this correlation is negative. Upon plotting the smooth line for the linear regression model, we deduced that for the same number of incarcerations, men on average have a higher income compared to women. Moreover, the line for females suggests that the highest number of incarcerations for women is 6, whereas for men it goes beyond 10.

## `summarise()` ungrouping output (override with `.groups` argument)
sex cor_inc_incarc
Male -0.1844140
Female -0.0932244
## `geom_smooth()` using formula 'y ~ x'

#### 10. hard.drugs

As part of our exploratory analysis, we were interested in comparing the average income for men and women within users and non-users of hard drugs. This enabled us to discern that usage of hard drugs causes the average income for both the genders to fall. However, within both the groups: users and non-users of hard drugs, men have higher average income compared to females. Interestingly, as seen in the summary table, the usage of hard drugs impacts women more negatively, as indicated by a fall in average income by 3,017 in comparison to men who see a fall of only $1,045.Moreover, as seen in the bar graph, we see that within each group of users and non-users of hard drugs, the average income is statistically different between men and women.

11. race

We wanted to explore how income gap between vary within each race for groups who have experienced hard times and those who have not. The interpretation is as follows: i) For Hispanic as well as Non-Black/Non-Hispanic races, the income gap between men and women is statistically significant for both groups: those who have experienced hard times and those who have not. However, the income gap between the two groups is not statistically significant. ii) For the Black race, the income gap between men and women is statistically significant only for the group of individuals who have not experienced hard times. Interestingly, for the group of individuals who have experienced hard times, women earn more than men in this race but this difference is not statistically significant. iii) For the Mixed Race, none of the respondents shared experiencing hard times, where as those who did, do not have a statistically significant difference in income between men and women.

12. emp_industry

To explore the industry type variable, we decided to explore the average income and how it differs between males and females within each industry type. Although no statistical tests were performed on this data, our exploratory analysis suggests that males earn higher on average than females for all industry types except the following ACS SPECIAL CODES. This gives us an overall sense of the income gap might prevail between the two genders, notwithstanding the industry impact on income.

## `summarise()` regrouping output by 'sex' (override with `.groups` argument)

13. father.edu

From the box plot, we can observe that the median income is the highest for 7TH YEAR COLLEGE individuals. Given the minimal no. of observations for 1ST GRADE i.e. 4, we believe that t test would not have been accurate because the number of values were very small. For 1ST GRADE, we do not see a box plot as there is only one observation. For UNGRADED, we do see a condensed box plot despite a small total value which we believe could just be a conicidence. To check whether income gap is statistically significant between men and women within group of individuals who have different levels of father education, we run a t-test and visualize it with confidence interval bars in a bar graph. The bar graph with error bars shows us three things: i) The income gap between men and women is statistically significant across different levels of father education except 3RD, 4TH, 5TH, 6TH, and 4TH YEAR COLLEGE since the error bars do not contain the null hypothesis value of 0. ii) The positive bars serve as an evidence that men earn more than women in group of individuals who have different levels of father education. iii) Since the error bar from different group contains the height of the bar from all other group, we can say that the income gap for those have different levels of father education is not statistically different.

## `summarise()` ungrouping output (override with `.groups` argument)
father.edu num.obs mean.income sd.income se.income
missing 1704 39785 26139 633
1ST GRADE 1 52000 NA NA
2ND GRADE 6 31000 17481 7137
3RD GRADE 27 36519 30294 5830
4TH GRADE 23 38042 22378 4666
5TH GRADE 26 35351 18368 3602
6TH GRADE 91 43074 22499 2359
7TH GRADE 41 41768 26433 4128
8TH GRADE 53 41369 28669 3938
9TH GRADE 105 41050 29188 2848
10TH GRADE 108 41668 26810 2580
11TH GRADE 149 38365 25833 2116
12TH GRADE 1089 46231 27561 835
1ST YEAR COLLEGE 196 47442 29610 2115
2ND YEAR COLLEGE 401 50625 30393 1518
3RD YEAR COLLEGE 102 51488 29575 2928
4TH YEAR COLLEGE 450 53596 31173 1470
5TH YEAR COLLEGE 73 53986 29263 3425
6TH YEAR COLLEGE 147 57313 34087 2811
7TH YEAR COLLEGE 56 65700 34145 4563
8TH YEAR COLLEGE 82 56027 33774 3730
UNGRADED 3 39333 5508 3180
## `summarise()` ungrouping output (override with `.groups` argument)

4. Findings

1. Linear Regression Model

## 
## Call:
## lm(formula = income ~ sex, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51135 -20137  -4659  13863 105841 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  51136.7      559.2   91.44   <2e-16 ***
## sexFemale   -11977.6      797.9  -15.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28020 on 4931 degrees of freedom
## Multiple R-squared:  0.0437, Adjusted R-squared:  0.04351 
## F-statistic: 225.3 on 1 and 4931 DF,  p-value: < 2.2e-16

Confidence Level is assumed at 95% and alpha = 0.05 for p-value comparison to reject the null hypothesis

First, we ran the basic linear regression model on income against sex assuming that other variables do not impact the income variable. The coefficient for gender male is absorbed in the intercept and the male average income would be 51136.7, while the Female average income would be 51136.7 + (1)(-11977.6). Both the intercept and the sexFemale coefficient are statistically significant at p-value = 2e-16 ***. The R^2 value is 0.03, indicating that the sex variable does not explain the variation in income to a great extent. Therefore, there is a need to add more variables from the dataset.

lm.add.race <- lm(income ~ sex + race, data = renamed_nlsy)
anova(lm.basic, lm.add.race)
## Analysis of Variance Table
## 
## Model 1: income ~ sex
## Model 2: income ~ sex + race
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   4931 3.8704e+12                                   
## 2   4928 3.7318e+12  3 1.3863e+11 61.025 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.add.race)
## 
## Call:
## lm(formula = income ~ sex + race, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -55177 -19327  -4233  14673 108767 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    42641.5      891.1  47.852  < 2e-16 ***
## sexFemale                     -11408.1      784.9 -14.535  < 2e-16 ***
## raceHispanic                    6480.3     1160.8   5.583  2.5e-08 ***
## raceMixed Race (Non-Hispanic)  11993.9     4090.4   2.932  0.00338 ** 
## raceNon-Black / Non-Hispanic   12685.7      953.1  13.310  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27520 on 4928 degrees of freedom
## Multiple R-squared:  0.07796,    Adjusted R-squared:  0.07721 
## F-statistic: 104.2 on 4 and 4928 DF,  p-value: < 2.2e-16

The second linear regression we run consists of race along with sex against income. Firstly, we run the anova test for lm models with and without race to check for statistical significance. P-vale from the anova test for with and without race is 2.2e-16, indicating high statistical significance. Therefore, there is a need for the race variable in the analysis. The adjusted R-squared value stands at 0.07721 showing again that the variability in income is only minimally being explained by the 2nd model.

lm.add.race.interact <- lm(income ~ sex + race + sex*race, data = renamed_nlsy)
anova(lm.add.race, lm.add.race.interact)
## Analysis of Variance Table
## 
## Model 1: income ~ sex + race
## Model 2: income ~ sex + race + sex * race
##   Res.Df        RSS Df  Sum of Sq      F   Pr(>F)    
## 1   4928 3.7318e+12                                  
## 2   4925 3.7174e+12  3 1.4341e+10 6.3332 0.000278 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.add.race.interact)
## 
## Call:
## lm(formula = income ~ sex + race + sex * race, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56052 -19202  -4458  13798 106018 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                39458       1152  34.259  < 2e-16
## sexFemale                                  -5475       1572  -3.483 0.000501
## raceHispanic                               10719       1659   6.460 1.15e-10
## raceMixed Race (Non-Hispanic)              16782       5614   2.989 0.002810
## raceNon-Black / Non-Hispanic               16744       1368  12.242  < 2e-16
## sexFemale:raceHispanic                     -8085       2320  -3.485 0.000496
## sexFemale:raceMixed Race (Non-Hispanic)    -9361       8184  -1.144 0.252759
## sexFemale:raceNon-Black / Non-Hispanic     -7791       1905  -4.090 4.38e-05
##                                            
## (Intercept)                             ***
## sexFemale                               ***
## raceHispanic                            ***
## raceMixed Race (Non-Hispanic)           ** 
## raceNon-Black / Non-Hispanic            ***
## sexFemale:raceHispanic                  ***
## sexFemale:raceMixed Race (Non-Hispanic)    
## sexFemale:raceNon-Black / Non-Hispanic  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27470 on 4925 degrees of freedom
## Multiple R-squared:  0.0815, Adjusted R-squared:  0.08019 
## F-statistic: 62.43 on 7 and 4925 DF,  p-value: < 2.2e-16

The follow-up regression model to the 2nd model in which race was included, includes the race and sex interaction term. Earlier in the paper we discussed (see: interaction terms) the need for an interaction term between sex and race. The anova test reiterates the need for one, with the p-value = 0.000278 showing the statistical significance.

lm.add.tot.incarc <- lm(income ~ sex + race + sex*race + total.incarcerations, data = renamed_nlsy)
anova(lm.add.race.interact, lm.add.tot.incarc)
## Analysis of Variance Table
## 
## Model 1: income ~ sex + race + sex * race
## Model 2: income ~ sex + race + sex * race + total.incarcerations
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   4925 3.7174e+12                                   
## 2   4924 3.6393e+12  1 7.8112e+10 105.69 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.add.tot.incarc)
## 
## Call:
## lm(formula = income ~ sex + race + sex * race + total.incarcerations, 
##     data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -57422 -18974  -4184  13395 105816 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              41681.0     1160.0  35.930  < 2e-16
## sexFemale                                -7497.2     1568.2  -4.781 1.80e-06
## raceHispanic                             10107.9     1643.1   6.152 8.26e-10
## raceMixed Race (Non-Hispanic)            15956.8     5556.0   2.872 0.004096
## raceNon-Black / Non-Hispanic             15891.5     1356.0  11.720  < 2e-16
## total.incarcerations                     -6989.1      679.8 -10.280  < 2e-16
## sexFemale:raceHispanic                   -7317.8     2296.8  -3.186 0.001451
## sexFemale:raceMixed Race (Non-Hispanic)  -8419.0     8098.6  -1.040 0.298595
## sexFemale:raceNon-Black / Non-Hispanic   -6800.0     1887.3  -3.603 0.000318
##                                            
## (Intercept)                             ***
## sexFemale                               ***
## raceHispanic                            ***
## raceMixed Race (Non-Hispanic)           ** 
## raceNon-Black / Non-Hispanic            ***
## total.incarcerations                    ***
## sexFemale:raceHispanic                  ** 
## sexFemale:raceMixed Race (Non-Hispanic)    
## sexFemale:raceNon-Black / Non-Hispanic  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27190 on 4924 degrees of freedom
## Multiple R-squared:  0.1008, Adjusted R-squared:  0.09934 
## F-statistic:    69 on 8 and 4924 DF,  p-value: < 2.2e-16

In this regression model we added total.incarcerations variable. The anova test was conducted to check for its statistical significance. The p-value came out to be 2.2e-16, suggesting high statistical significance. The adjusted R^squared value with total.incarcerations in the model is 0.0993. This is an increase in the adjusted R^squared value suggests that with increase addition of total.incarcerations more of the variability of income is being explained by the model.

lm.add.marital.status <- lm(income ~ sex + race + sex*race + total.incarcerations + marital.status, data = renamed_nlsy)
anova(lm.add.tot.incarc, lm.add.marital.status)
## Analysis of Variance Table
## 
## Model 1: income ~ sex + race + sex * race + total.incarcerations
## Model 2: income ~ sex + race + sex * race + total.incarcerations + marital.status
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   4924 3.6393e+12                                   
## 2   4919 3.5511e+12  5 8.8208e+10 24.437 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.add.marital.status)
## 
## Call:
## lm(formula = income ~ sex + race + sex * race + total.incarcerations + 
##     marital.status, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -61057 -19015  -4365  13576 108292 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              39356.3     1644.0  23.940  < 2e-16
## sexFemale                                -6691.8     1551.7  -4.313 1.64e-05
## raceHispanic                              9132.3     1627.3   5.612 2.11e-08
## raceMixed Race (Non-Hispanic)            14705.3     5493.4   2.677 0.007455
## raceNon-Black / Non-Hispanic             13964.7     1352.9  10.322  < 2e-16
## total.incarcerations                     -6241.8      676.1  -9.232  < 2e-16
## marital.statusMarried                     7935.8     1321.1   6.007 2.03e-09
## marital.statusmissing                    -2669.6     5053.8  -0.528 0.597354
## marital.statusNever married               -956.0     1357.5  -0.704 0.481306
## marital.statusSeparated                   -739.3     2899.1  -0.255 0.798729
## marital.statusWidowed                    -3085.5     6636.1  -0.465 0.641988
## sexFemale:raceHispanic                   -8505.0     2274.8  -3.739 0.000187
## sexFemale:raceMixed Race (Non-Hispanic)  -8494.0     8004.0  -1.061 0.288640
## sexFemale:raceNon-Black / Non-Hispanic   -7764.4     1869.3  -4.154 3.33e-05
##                                            
## (Intercept)                             ***
## sexFemale                               ***
## raceHispanic                            ***
## raceMixed Race (Non-Hispanic)           ** 
## raceNon-Black / Non-Hispanic            ***
## total.incarcerations                    ***
## marital.statusMarried                   ***
## marital.statusmissing                      
## marital.statusNever married                
## marital.statusSeparated                    
## marital.statusWidowed                      
## sexFemale:raceHispanic                  ***
## sexFemale:raceMixed Race (Non-Hispanic)    
## sexFemale:raceNon-Black / Non-Hispanic  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26870 on 4919 degrees of freedom
## Multiple R-squared:  0.1226, Adjusted R-squared:  0.1203 
## F-statistic: 52.87 on 13 and 4919 DF,  p-value: < 2.2e-16

In this regression model we added the variable marital.status. The anova test was carried out to check for it statistical significance. The p-value for the anova test with and without marital.status came out to be 2.2e-16, suggesting statistical significance. The Adjusted R-squared: 0.1203 shows that there is an increase from the Adjusted R-squared value for the model without marital.status variable. This shows that greater variability of income can now be explained by the model.

lm.add.hard.times <- lm(income ~ sex + race + sex*race + total.incarcerations + marital.status + hard.times, data = renamed_nlsy)
anova(lm.add.marital.status, lm.add.hard.times)
## Analysis of Variance Table
## 
## Model 1: income ~ sex + race + sex * race + total.incarcerations + marital.status
## Model 2: income ~ sex + race + sex * race + total.incarcerations + marital.status + 
##     hard.times
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   4919 3.5511e+12                                   
## 2   4917 3.5402e+12  2 1.0922e+10 7.5852 0.0005139 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.add.hard.times) 
## 
## Call:
## lm(formula = income ~ sex + race + sex * race + total.incarcerations + 
##     marital.status + hard.times, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -61274 -18914  -4334  13526 107971 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              39624.9     1981.4  19.998  < 2e-16
## sexFemale                                -6885.3     1550.9  -4.440 9.21e-06
## raceHispanic                              9037.6     1626.7   5.556 2.91e-08
## raceMixed Race (Non-Hispanic)            14484.9     5487.9   2.639 0.008331
## raceNon-Black / Non-Hispanic             13652.3     1353.5  10.087  < 2e-16
## total.incarcerations                     -6081.9      676.6  -8.989  < 2e-16
## marital.statusMarried                     7972.4     1319.4   6.043 1.63e-09
## marital.statusmissing                    -2547.0     5047.2  -0.505 0.613834
## marital.statusNever married               -934.4     1355.7  -0.689 0.490708
## marital.statusSeparated                   -607.7     2896.0  -0.210 0.833802
## marital.statusWidowed                    -3056.5     6629.2  -0.461 0.644766
## hard.timesno                               224.0     1224.6   0.183 0.854899
## hard.timesyes                            -6915.6     2119.2  -3.263 0.001109
## sexFemale:raceHispanic                   -8274.2     2272.6  -3.641 0.000274
## sexFemale:raceMixed Race (Non-Hispanic)  -8219.2     7993.8  -1.028 0.303908
## sexFemale:raceNon-Black / Non-Hispanic   -7490.5     1868.5  -4.009 6.19e-05
##                                            
## (Intercept)                             ***
## sexFemale                               ***
## raceHispanic                            ***
## raceMixed Race (Non-Hispanic)           ** 
## raceNon-Black / Non-Hispanic            ***
## total.incarcerations                    ***
## marital.statusMarried                   ***
## marital.statusmissing                      
## marital.statusNever married                
## marital.statusSeparated                    
## marital.statusWidowed                      
## hard.timesno                               
## hard.timesyes                           ** 
## sexFemale:raceHispanic                  ***
## sexFemale:raceMixed Race (Non-Hispanic)    
## sexFemale:raceNon-Black / Non-Hispanic  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26830 on 4917 degrees of freedom
## Multiple R-squared:  0.1253, Adjusted R-squared:  0.1226 
## F-statistic: 46.95 on 15 and 4917 DF,  p-value: < 2.2e-16

In this step we are adding hard.times variable to the regression model. To check for the statistical significance of adding this variable, anova test was conducted. The p-value 0.000001279 suggests statistical significance. The Adjusted R-squared value came out to be 0.1226. This again, is an increase on the adjusted R-squared value we got from the model without hard.times, suggesting greater variability of income is now being accounted for.

lm.add.work.lim <- lm(income ~ sex + race + sex*race + total.incarcerations + marital.status + hard.times + work.limitation, data = renamed_nlsy)
anova(lm.add.hard.times, lm.add.work.lim)
## Analysis of Variance Table
## 
## Model 1: income ~ sex + race + sex * race + total.incarcerations + marital.status + 
##     hard.times
## Model 2: income ~ sex + race + sex * race + total.incarcerations + marital.status + 
##     hard.times + work.limitation
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   4917 3.5402e+12                                   
## 2   4915 3.5165e+12  2 2.3677e+10 16.547 6.884e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.add.work.lim) 
## 
## Call:
## lm(formula = income ~ sex + race + sex * race + total.incarcerations + 
##     marital.status + hard.times + work.limitation, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -61895 -18447  -4366  13487 107580 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              39702.51    1978.17  20.070  < 2e-16
## sexFemale                                -6865.25    1546.20  -4.440 9.19e-06
## raceHispanic                              9164.09    1622.14   5.649 1.70e-08
## raceMixed Race (Non-Hispanic)            14227.55    5470.85   2.601 0.009334
## raceNon-Black / Non-Hispanic             14067.14    1351.17  10.411  < 2e-16
## total.incarcerations                     -5951.61     675.00  -8.817  < 2e-16
## marital.statusMarried                     7596.14    1316.90   5.768 8.50e-09
## marital.statusmissing                    -2975.30    5031.87  -0.591 0.554353
## marital.statusNever married              -1146.50    1351.96  -0.848 0.396466
## marital.statusSeparated                   -768.66    2887.64  -0.266 0.790105
## marital.statusWidowed                    -3157.48    6608.33  -0.478 0.632812
## hard.timesno                               713.49    8101.03   0.088 0.929821
## hard.timesyes                            -6059.94    8217.16  -0.737 0.460870
## work.limitationno                           16.07    8118.59   0.002 0.998421
## work.limitationyes                      -10117.61    8267.84  -1.224 0.221113
## sexFemale:raceHispanic                   -8441.76    2265.63  -3.726 0.000197
## sexFemale:raceMixed Race (Non-Hispanic)  -7738.79    7969.10  -0.971 0.331546
## sexFemale:raceNon-Black / Non-Hispanic   -7586.76    1862.88  -4.073 4.72e-05
##                                            
## (Intercept)                             ***
## sexFemale                               ***
## raceHispanic                            ***
## raceMixed Race (Non-Hispanic)           ** 
## raceNon-Black / Non-Hispanic            ***
## total.incarcerations                    ***
## marital.statusMarried                   ***
## marital.statusmissing                      
## marital.statusNever married                
## marital.statusSeparated                    
## marital.statusWidowed                      
## hard.timesno                               
## hard.timesyes                              
## work.limitationno                          
## work.limitationyes                         
## sexFemale:raceHispanic                  ***
## sexFemale:raceMixed Race (Non-Hispanic)    
## sexFemale:raceNon-Black / Non-Hispanic  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26750 on 4915 degrees of freedom
## Multiple R-squared:  0.1311, Adjusted R-squared:  0.1281 
## F-statistic: 43.64 on 17 and 4915 DF,  p-value: < 2.2e-16

In this step we added another variable, that is work.limitation, to the regression model. The p-value for the anova for models with and without the variable work.limitation is 5.549e-10, suggesting statistical significance. The adjusted R-squared value came out to be 0.1281. This suggested an increase in R-squared value from the model without the work.limitation. This again, suggests progressively greater variability of the output is being explained by the model as more relevant variables are being added.

lm.add.highest.degree <- lm(income ~ sex + race + sex*race + total.incarcerations + marital.status + marital.status + work.limitation + highest.degree, data = renamed_nlsy)
anova(lm.add.work.lim, lm.add.highest.degree)
## Analysis of Variance Table
## 
## Model 1: income ~ sex + race + sex * race + total.incarcerations + marital.status + 
##     hard.times + work.limitation
## Model 2: income ~ sex + race + sex * race + total.incarcerations + marital.status + 
##     marital.status + work.limitation + highest.degree
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   4915 3.5165e+12                                   
## 2   4909 3.0241e+12  6 4.9239e+11 133.22 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.add.highest.degree)
## 
## Call:
## lm(formula = income ~ sex + race + sex * race + total.incarcerations + 
##     marital.status + marital.status + work.limitation + highest.degree, 
##     data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -79974 -16861  -3488  12887 110464 
## 
## Coefficients:
##                                                 Estimate Std. Error t value
## (Intercept)                                      49536.6     2248.9  22.027
## sexFemale                                        -8902.7     1438.3  -6.190
## raceHispanic                                      9324.9     1507.6   6.185
## raceMixed Race (Non-Hispanic)                     5054.3     5089.7   0.993
## raceNon-Black / Non-Hispanic                     11067.3     1258.4   8.795
## total.incarcerations                             -3566.6      639.4  -5.578
## marital.statusMarried                             3167.8     1233.1   2.569
## marital.statusmissing                            -3860.1     4671.8  -0.826
## marital.statusNever married                      -3596.5     1258.8  -2.857
## marital.statusSeparated                           -636.3     2682.6  -0.237
## marital.statusWidowed                            -2121.4     6138.8  -0.346
## work.limitationno                                -1610.9     1139.3  -1.414
## work.limitationyes                               -8622.1     1919.3  -4.492
## highest.degreeBachelor's degree (BA, BS)          9854.1     1496.8   6.584
## highest.degreeGED                               -13794.9     1714.0  -8.048
## highest.degreeHigh school diploma 12 year        -8283.1     1400.4  -5.915
## highest.degreeMaster's degree (MA, MS)           18525.8     2002.8   9.250
## highest.degreemissing                            -2452.1     1935.4  -1.267
## highest.degreeNone                              -18919.7     1966.3  -9.622
## highest.degreePhD                                37889.6     7290.4   5.197
## highest.degreeProfessional degree (DDS, JD, MD)  38889.8     4871.5   7.983
## sexFemale:raceHispanic                           -7534.0     2103.0  -3.583
## sexFemale:raceMixed Race (Non-Hispanic)            945.5     7405.3   0.128
## sexFemale:raceNon-Black / Non-Hispanic           -8674.3     1730.3  -5.013
##                                                 Pr(>|t|)    
## (Intercept)                                      < 2e-16 ***
## sexFemale                                       6.52e-10 ***
## raceHispanic                                    6.71e-10 ***
## raceMixed Race (Non-Hispanic)                   0.320742    
## raceNon-Black / Non-Hispanic                     < 2e-16 ***
## total.incarcerations                            2.56e-08 ***
## marital.statusMarried                           0.010226 *  
## marital.statusmissing                           0.408690    
## marital.statusNever married                     0.004293 ** 
## marital.statusSeparated                         0.812530    
## marital.statusWidowed                           0.729677    
## work.limitationno                               0.157465    
## work.limitationyes                              7.21e-06 ***
## highest.degreeBachelor's degree (BA, BS)        5.07e-11 ***
## highest.degreeGED                               1.04e-15 ***
## highest.degreeHigh school diploma 12 year       3.55e-09 ***
## highest.degreeMaster's degree (MA, MS)           < 2e-16 ***
## highest.degreemissing                           0.205217    
## highest.degreeNone                               < 2e-16 ***
## highest.degreePhD                               2.11e-07 ***
## highest.degreeProfessional degree (DDS, JD, MD) 1.76e-15 ***
## sexFemale:raceHispanic                          0.000344 ***
## sexFemale:raceMixed Race (Non-Hispanic)         0.898412    
## sexFemale:raceNon-Black / Non-Hispanic          5.54e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24820 on 4909 degrees of freedom
## Multiple R-squared:  0.2528, Adjusted R-squared:  0.2493 
## F-statistic: 72.21 on 23 and 4909 DF,  p-value: < 2.2e-16

Next we decided to add the variable emp_industry_2011 to our model. To find out if it was statistically significant, we ran an anova test with and without this variable. The p-value for this test came out to be 2.2e-16, suggesting a very high statistical significance. The Adjusted R-squared: 0.2493 shows that there is an increase from the Adjusted R-squared value for the model without emp_industry_2011. This shows that a higher amount of variability of income can be explained using this variable. This makes sense, because better employment opportunities lead to a higher income.

lm.add.emp.ind <- lm(income ~ sex + race + sex*race + total.incarcerations + marital.status + marital.status + work.limitation + highest.degree + emp_industry_2011, data = renamed_nlsy)
anova(lm.add.highest.degree, lm.add.emp.ind)
## Analysis of Variance Table
## 
## Model 1: income ~ sex + race + sex * race + total.incarcerations + marital.status + 
##     marital.status + work.limitation + highest.degree
## Model 2: income ~ sex + race + sex * race + total.incarcerations + marital.status + 
##     marital.status + work.limitation + highest.degree + emp_industry_2011
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   4909 3.0241e+12                                   
## 2   4892 2.9076e+12 17 1.1651e+11 11.531 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.add.emp.ind)
## 
## Call:
## lm(formula = income ~ sex + race + sex * race + total.incarcerations + 
##     marital.status + marital.status + work.limitation + highest.degree + 
##     emp_industry_2011, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -73833 -16149  -3133  12629 106146 
## 
## Coefficients:
##                                                                   Estimate
## (Intercept)                                                       46157.85
## sexFemale                                                         -6806.61
## raceHispanic                                                       8988.70
## raceMixed Race (Non-Hispanic)                                      6649.36
## raceNon-Black / Non-Hispanic                                      10385.39
## total.incarcerations                                              -3279.25
## marital.statusMarried                                              3090.45
## marital.statusmissing                                             -1495.51
## marital.statusNever married                                       -3216.59
## marital.statusSeparated                                             -39.13
## marital.statusWidowed                                             -1817.19
## work.limitationno                                                 -1274.25
## work.limitationyes                                                -8292.08
## highest.degreeBachelor's degree (BA, BS)                           9392.17
## highest.degreeGED                                                -13427.66
## highest.degreeHigh school diploma 12 year                         -8142.26
## highest.degreeMaster's degree (MA, MS)                            18352.54
## highest.degreemissing                                              2559.04
## highest.degreeNone                                               -17978.55
## highest.degreePhD                                                 39730.44
## highest.degreeProfessional degree (DDS, JD, MD)                   37282.71
## emp_industry_2011ACTIVE DUTY MILITARY                             14374.04
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES               3377.00
## emp_industry_2011CONSTRUCTION                                      7195.99
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES         -1006.62
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES  -3396.45
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE               7339.02
## emp_industry_2011INFORMATION AND COMMUNICATION                     6301.95
## emp_industry_2011MANUFACTURING                                     7508.35
## emp_industry_2011MINING                                           25475.80
## emp_industry_2011missing                                          -3351.72
## emp_industry_2011OTHER SERVICES                                   -3481.04
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES                 5175.87
## emp_industry_2011PUBLIC ADMINISTRATION                            10326.64
## emp_industry_2011RETAIL TRADE                                      -868.36
## emp_industry_2011TRANSPORTATION AND WAREHOUSING                    7317.14
## emp_industry_2011UTILITIES                                        21433.31
## emp_industry_2011WHOLESALE TRADE                                   5910.18
## sexFemale:raceHispanic                                            -7432.29
## sexFemale:raceMixed Race (Non-Hispanic)                             691.75
## sexFemale:raceNon-Black / Non-Hispanic                            -8106.49
##                                                                  Std. Error
## (Intercept)                                                         8900.58
## sexFemale                                                           1437.85
## raceHispanic                                                        1483.80
## raceMixed Race (Non-Hispanic)                                       5008.97
## raceNon-Black / Non-Hispanic                                        1244.27
## total.incarcerations                                                 632.77
## marital.statusMarried                                               1215.52
## marital.statusmissing                                               4602.01
## marital.statusNever married                                         1240.79
## marital.statusSeparated                                             2636.98
## marital.statusWidowed                                               6033.17
## work.limitationno                                                   1120.76
## work.limitationyes                                                  1887.05
## highest.degreeBachelor's degree (BA, BS)                            1476.15
## highest.degreeGED                                                   1693.43
## highest.degreeHigh school diploma 12 year                           1383.08
## highest.degreeMaster's degree (MA, MS)                              1984.83
## highest.degreemissing                                               2112.57
## highest.degreeNone                                                  1949.51
## highest.degreePhD                                                   7170.89
## highest.degreeProfessional degree (DDS, JD, MD)                     4802.70
## emp_industry_2011ACTIVE DUTY MILITARY                              10386.07
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES                9656.47
## emp_industry_2011CONSTRUCTION                                       8782.80
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES           8670.15
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES    8708.19
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE                8747.52
## emp_industry_2011INFORMATION AND COMMUNICATION                      8939.55
## emp_industry_2011MANUFACTURING                                      8745.28
## emp_industry_2011MINING                                             9834.23
## emp_industry_2011missing                                            8702.65
## emp_industry_2011OTHER SERVICES                                     8797.64
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES                  8699.90
## emp_industry_2011PUBLIC ADMINISTRATION                              8809.51
## emp_industry_2011RETAIL TRADE                                       8708.54
## emp_industry_2011TRANSPORTATION AND WAREHOUSING                     8887.03
## emp_industry_2011UTILITIES                                         10021.98
## emp_industry_2011WHOLESALE TRADE                                    8930.62
## sexFemale:raceHispanic                                              2069.54
## sexFemale:raceMixed Race (Non-Hispanic)                             7279.02
## sexFemale:raceNon-Black / Non-Hispanic                              1706.04
##                                                                  t value
## (Intercept)                                                        5.186
## sexFemale                                                         -4.734
## raceHispanic                                                       6.058
## raceMixed Race (Non-Hispanic)                                      1.327
## raceNon-Black / Non-Hispanic                                       8.347
## total.incarcerations                                              -5.182
## marital.statusMarried                                              2.542
## marital.statusmissing                                             -0.325
## marital.statusNever married                                       -2.592
## marital.statusSeparated                                           -0.015
## marital.statusWidowed                                             -0.301
## work.limitationno                                                 -1.137
## work.limitationyes                                                -4.394
## highest.degreeBachelor's degree (BA, BS)                           6.363
## highest.degreeGED                                                 -7.929
## highest.degreeHigh school diploma 12 year                         -5.887
## highest.degreeMaster's degree (MA, MS)                             9.246
## highest.degreemissing                                              1.211
## highest.degreeNone                                                -9.222
## highest.degreePhD                                                  5.541
## highest.degreeProfessional degree (DDS, JD, MD)                    7.763
## emp_industry_2011ACTIVE DUTY MILITARY                              1.384
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES               0.350
## emp_industry_2011CONSTRUCTION                                      0.819
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES         -0.116
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES  -0.390
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE               0.839
## emp_industry_2011INFORMATION AND COMMUNICATION                     0.705
## emp_industry_2011MANUFACTURING                                     0.859
## emp_industry_2011MINING                                            2.591
## emp_industry_2011missing                                          -0.385
## emp_industry_2011OTHER SERVICES                                   -0.396
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES                 0.595
## emp_industry_2011PUBLIC ADMINISTRATION                             1.172
## emp_industry_2011RETAIL TRADE                                     -0.100
## emp_industry_2011TRANSPORTATION AND WAREHOUSING                    0.823
## emp_industry_2011UTILITIES                                         2.139
## emp_industry_2011WHOLESALE TRADE                                   0.662
## sexFemale:raceHispanic                                            -3.591
## sexFemale:raceMixed Race (Non-Hispanic)                            0.095
## sexFemale:raceNon-Black / Non-Hispanic                            -4.752
##                                                                  Pr(>|t|)    
## (Intercept)                                                      2.24e-07 ***
## sexFemale                                                        2.26e-06 ***
## raceHispanic                                                     1.48e-09 ***
## raceMixed Race (Non-Hispanic)                                    0.184409    
## raceNon-Black / Non-Hispanic                                      < 2e-16 ***
## total.incarcerations                                             2.28e-07 ***
## marital.statusMarried                                            0.011037 *  
## marital.statusmissing                                            0.745219    
## marital.statusNever married                                      0.009560 ** 
## marital.statusSeparated                                          0.988161    
## marital.statusWidowed                                            0.763275    
## work.limitationno                                                0.255614    
## work.limitationyes                                               1.14e-05 ***
## highest.degreeBachelor's degree (BA, BS)                         2.16e-10 ***
## highest.degreeGED                                                2.71e-15 ***
## highest.degreeHigh school diploma 12 year                        4.19e-09 ***
## highest.degreeMaster's degree (MA, MS)                            < 2e-16 ***
## highest.degreemissing                                            0.225823    
## highest.degreeNone                                                < 2e-16 ***
## highest.degreePhD                                                3.17e-08 ***
## highest.degreeProfessional degree (DDS, JD, MD)                  1.00e-14 ***
## emp_industry_2011ACTIVE DUTY MILITARY                            0.166430    
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES             0.726569    
## emp_industry_2011CONSTRUCTION                                    0.412639    
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES        0.907576    
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES 0.696532    
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE             0.401520    
## emp_industry_2011INFORMATION AND COMMUNICATION                   0.480874    
## emp_industry_2011MANUFACTURING                                   0.390625    
## emp_industry_2011MINING                                          0.009611 ** 
## emp_industry_2011missing                                         0.700152    
## emp_industry_2011OTHER SERVICES                                  0.692359    
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES               0.551915    
## emp_industry_2011PUBLIC ADMINISTRATION                           0.241168    
## emp_industry_2011RETAIL TRADE                                    0.920576    
## emp_industry_2011TRANSPORTATION AND WAREHOUSING                  0.410349    
## emp_industry_2011UTILITIES                                       0.032515 *  
## emp_industry_2011WHOLESALE TRADE                                 0.508138    
## sexFemale:raceHispanic                                           0.000332 ***
## sexFemale:raceMixed Race (Non-Hispanic)                          0.924292    
## sexFemale:raceNon-Black / Non-Hispanic                           2.08e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24380 on 4892 degrees of freedom
## Multiple R-squared:  0.2816, Adjusted R-squared:  0.2757 
## F-statistic: 47.94 on 40 and 4892 DF,  p-value: < 2.2e-16

Next we decided to check with the variable highest.degree because intuitively, it also seems to affect the income variable. To find out if this was actually true and statistically significant, we ran an anova test with and without this variable. The p-value for this test came out to be 2.2e-16, suggesting a very high statistical significance. The Adjusted R-squared: 0.2757 shows that a lot of variability of the income variable can be explained with the variable highest.degree. -

lm.add.citizenship <- lm(income ~ sex + race + sex*race + total.incarcerations + marital.status + marital.status + work.limitation + highest.degree + emp_industry_2011 + citizenship, data = renamed_nlsy)
anova(lm.add.emp.ind, lm.add.citizenship)
## Analysis of Variance Table
## 
## Model 1: income ~ sex + race + sex * race + total.incarcerations + marital.status + 
##     marital.status + work.limitation + highest.degree + emp_industry_2011
## Model 2: income ~ sex + race + sex * race + total.incarcerations + marital.status + 
##     marital.status + work.limitation + highest.degree + emp_industry_2011 + 
##     citizenship
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   4892 2.9076e+12                                   
## 2   4889 2.8959e+12  3 1.1703e+10 6.5857 0.0001941 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.add.citizenship)
## 
## Call:
## lm(formula = income ~ sex + race + sex * race + total.incarcerations + 
##     marital.status + marital.status + work.limitation + highest.degree + 
##     emp_industry_2011 + citizenship, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -73713 -16025  -3103  12843 106607 
## 
## Coefficients:
##                                                                   Estimate
## (Intercept)                                                       47391.08
## sexFemale                                                         -6727.94
## raceHispanic                                                       7154.85
## raceMixed Race (Non-Hispanic)                                      6449.02
## raceNon-Black / Non-Hispanic                                      10422.50
## total.incarcerations                                              -3212.12
## marital.statusMarried                                              3252.41
## marital.statusmissing                                             -1817.89
## marital.statusNever married                                       -3156.73
## marital.statusSeparated                                             -94.99
## marital.statusWidowed                                             -1758.43
## work.limitationno                                                  5591.76
## work.limitationyes                                                -1340.46
## highest.degreeBachelor's degree (BA, BS)                           9436.58
## highest.degreeGED                                                -13179.92
## highest.degreeHigh school diploma 12 year                         -7967.04
## highest.degreeMaster's degree (MA, MS)                            18418.95
## highest.degreemissing                                              2521.14
## highest.degreeNone                                               -18042.32
## highest.degreePhD                                                 39296.52
## highest.degreeProfessional degree (DDS, JD, MD)                   37439.44
## emp_industry_2011ACTIVE DUTY MILITARY                             13871.21
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES               2901.86
## emp_industry_2011CONSTRUCTION                                      6759.10
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES         -1671.89
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES  -3994.65
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE               6730.40
## emp_industry_2011INFORMATION AND COMMUNICATION                     5665.48
## emp_industry_2011MANUFACTURING                                     6906.92
## emp_industry_2011MINING                                           25441.24
## emp_industry_2011missing                                          -3887.45
## emp_industry_2011OTHER SERVICES                                   -4234.86
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES                 4492.57
## emp_industry_2011PUBLIC ADMINISTRATION                             9817.80
## emp_industry_2011RETAIL TRADE                                     -1428.69
## emp_industry_2011TRANSPORTATION AND WAREHOUSING                    6797.22
## emp_industry_2011UTILITIES                                        20796.20
## emp_industry_2011WHOLESALE TRADE                                   5338.84
## citizenshipunknown.birthplace                                     -2237.33
## citizenshipUnknown.not.us.born                                    -4336.43
## citizenshipUS.born.citizen                                        -8033.69
## sexFemale:raceHispanic                                            -7575.78
## sexFemale:raceMixed Race (Non-Hispanic)                             589.36
## sexFemale:raceNon-Black / Non-Hispanic                            -8128.30
##                                                                  Std. Error
## (Intercept)                                                         8890.11
## sexFemale                                                           1435.56
## raceHispanic                                                        1551.43
## raceMixed Race (Non-Hispanic)                                       5003.73
## raceNon-Black / Non-Hispanic                                        1242.39
## total.incarcerations                                                 631.95
## marital.statusMarried                                               1214.53
## marital.statusmissing                                               4594.98
## marital.statusNever married                                         1238.87
## marital.statusSeparated                                             2632.72
## marital.statusWidowed                                               6024.00
## work.limitationno                                                   4898.87
## work.limitationyes                                                  5123.18
## highest.degreeBachelor's degree (BA, BS)                            1474.27
## highest.degreeGED                                                   1691.50
## highest.degreeHigh school diploma 12 year                           1381.68
## highest.degreeMaster's degree (MA, MS)                              1982.47
## highest.degreemissing                                               2109.94
## highest.degreeNone                                                  1947.26
## highest.degreePhD                                                   7161.69
## highest.degreeProfessional degree (DDS, JD, MD)                     4794.69
## emp_industry_2011ACTIVE DUTY MILITARY                              10369.15
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES                9640.91
## emp_industry_2011CONSTRUCTION                                       8769.19
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES           8657.39
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES    8694.93
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE                8734.48
## emp_industry_2011INFORMATION AND COMMUNICATION                      8925.98
## emp_industry_2011MANUFACTURING                                      8731.63
## emp_industry_2011MINING                                             9817.50
## emp_industry_2011missing                                            8689.41
## emp_industry_2011OTHER SERVICES                                     8784.72
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES                  8687.17
## emp_industry_2011PUBLIC ADMINISTRATION                              8795.53
## emp_industry_2011RETAIL TRADE                                       8695.51
## emp_industry_2011TRANSPORTATION AND WAREHOUSING                     8872.84
## emp_industry_2011UTILITIES                                         10009.73
## emp_industry_2011WHOLESALE TRADE                                    8916.91
## citizenshipunknown.birthplace                                       5137.17
## citizenshipUnknown.not.us.born                                      5484.89
## citizenshipUS.born.citizen                                          5002.79
## sexFemale:raceHispanic                                              2067.36
## sexFemale:raceMixed Race (Non-Hispanic)                             7268.97
## sexFemale:raceNon-Black / Non-Hispanic                              1703.33
##                                                                  t value
## (Intercept)                                                        5.331
## sexFemale                                                         -4.687
## raceHispanic                                                       4.612
## raceMixed Race (Non-Hispanic)                                      1.289
## raceNon-Black / Non-Hispanic                                       8.389
## total.incarcerations                                              -5.083
## marital.statusMarried                                              2.678
## marital.statusmissing                                             -0.396
## marital.statusNever married                                       -2.548
## marital.statusSeparated                                           -0.036
## marital.statusWidowed                                             -0.292
## work.limitationno                                                  1.141
## work.limitationyes                                                -0.262
## highest.degreeBachelor's degree (BA, BS)                           6.401
## highest.degreeGED                                                 -7.792
## highest.degreeHigh school diploma 12 year                         -5.766
## highest.degreeMaster's degree (MA, MS)                             9.291
## highest.degreemissing                                              1.195
## highest.degreeNone                                                -9.265
## highest.degreePhD                                                  5.487
## highest.degreeProfessional degree (DDS, JD, MD)                    7.809
## emp_industry_2011ACTIVE DUTY MILITARY                              1.338
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES               0.301
## emp_industry_2011CONSTRUCTION                                      0.771
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES         -0.193
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES  -0.459
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE               0.771
## emp_industry_2011INFORMATION AND COMMUNICATION                     0.635
## emp_industry_2011MANUFACTURING                                     0.791
## emp_industry_2011MINING                                            2.591
## emp_industry_2011missing                                          -0.447
## emp_industry_2011OTHER SERVICES                                   -0.482
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES                 0.517
## emp_industry_2011PUBLIC ADMINISTRATION                             1.116
## emp_industry_2011RETAIL TRADE                                     -0.164
## emp_industry_2011TRANSPORTATION AND WAREHOUSING                    0.766
## emp_industry_2011UTILITIES                                         2.078
## emp_industry_2011WHOLESALE TRADE                                   0.599
## citizenshipunknown.birthplace                                     -0.436
## citizenshipUnknown.not.us.born                                    -0.791
## citizenshipUS.born.citizen                                        -1.606
## sexFemale:raceHispanic                                            -3.664
## sexFemale:raceMixed Race (Non-Hispanic)                            0.081
## sexFemale:raceNon-Black / Non-Hispanic                            -4.772
##                                                                  Pr(>|t|)    
## (Intercept)                                                      1.02e-07 ***
## sexFemale                                                        2.85e-06 ***
## raceHispanic                                                     4.09e-06 ***
## raceMixed Race (Non-Hispanic)                                     0.19751    
## raceNon-Black / Non-Hispanic                                      < 2e-16 ***
## total.incarcerations                                             3.86e-07 ***
## marital.statusMarried                                             0.00743 ** 
## marital.statusmissing                                             0.69240    
## marital.statusNever married                                       0.01086 *  
## marital.statusSeparated                                           0.97122    
## marital.statusWidowed                                             0.77037    
## work.limitationno                                                 0.25374    
## work.limitationyes                                                0.79361    
## highest.degreeBachelor's degree (BA, BS)                         1.69e-10 ***
## highest.degreeGED                                                8.01e-15 ***
## highest.degreeHigh school diploma 12 year                        8.61e-09 ***
## highest.degreeMaster's degree (MA, MS)                            < 2e-16 ***
## highest.degreemissing                                             0.23219    
## highest.degreeNone                                                < 2e-16 ***
## highest.degreePhD                                                4.29e-08 ***
## highest.degreeProfessional degree (DDS, JD, MD)                  7.03e-15 ***
## emp_industry_2011ACTIVE DUTY MILITARY                             0.18104    
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES              0.76343    
## emp_industry_2011CONSTRUCTION                                     0.44088    
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES         0.84687    
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES  0.64595    
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE              0.44101    
## emp_industry_2011INFORMATION AND COMMUNICATION                    0.52564    
## emp_industry_2011MANUFACTURING                                    0.42897    
## emp_industry_2011MINING                                           0.00959 ** 
## emp_industry_2011missing                                          0.65462    
## emp_industry_2011OTHER SERVICES                                   0.62978    
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES                0.60507    
## emp_industry_2011PUBLIC ADMINISTRATION                            0.26438    
## emp_industry_2011RETAIL TRADE                                     0.86950    
## emp_industry_2011TRANSPORTATION AND WAREHOUSING                   0.44367    
## emp_industry_2011UTILITIES                                        0.03780 *  
## emp_industry_2011WHOLESALE TRADE                                  0.54938    
## citizenshipunknown.birthplace                                     0.66320    
## citizenshipUnknown.not.us.born                                    0.42921    
## citizenshipUS.born.citizen                                        0.10837    
## sexFemale:raceHispanic                                            0.00025 ***
## sexFemale:raceMixed Race (Non-Hispanic)                           0.93538    
## sexFemale:raceNon-Black / Non-Hispanic                           1.88e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24340 on 4889 degrees of freedom
## Multiple R-squared:  0.2845, Adjusted R-squared:  0.2782 
## F-statistic: 45.21 on 43 and 4889 DF,  p-value: < 2.2e-16

We also decided to check the variable citizenship in case there were any statistically significant differences in income because of this. The p-value came out to be 0.00000006406 which is very significant. Adjusted R squared value came out to be 0.2782 which means that citizenship does explain a lot of variability of the data, along with the other variables.

lm.final <- lm.add.citizenship
summary(lm.final)
## 
## Call:
## lm(formula = income ~ sex + race + sex * race + total.incarcerations + 
##     marital.status + marital.status + work.limitation + highest.degree + 
##     emp_industry_2011 + citizenship, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -73713 -16025  -3103  12843 106607 
## 
## Coefficients:
##                                                                   Estimate
## (Intercept)                                                       47391.08
## sexFemale                                                         -6727.94
## raceHispanic                                                       7154.85
## raceMixed Race (Non-Hispanic)                                      6449.02
## raceNon-Black / Non-Hispanic                                      10422.50
## total.incarcerations                                              -3212.12
## marital.statusMarried                                              3252.41
## marital.statusmissing                                             -1817.89
## marital.statusNever married                                       -3156.73
## marital.statusSeparated                                             -94.99
## marital.statusWidowed                                             -1758.43
## work.limitationno                                                  5591.76
## work.limitationyes                                                -1340.46
## highest.degreeBachelor's degree (BA, BS)                           9436.58
## highest.degreeGED                                                -13179.92
## highest.degreeHigh school diploma 12 year                         -7967.04
## highest.degreeMaster's degree (MA, MS)                            18418.95
## highest.degreemissing                                              2521.14
## highest.degreeNone                                               -18042.32
## highest.degreePhD                                                 39296.52
## highest.degreeProfessional degree (DDS, JD, MD)                   37439.44
## emp_industry_2011ACTIVE DUTY MILITARY                             13871.21
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES               2901.86
## emp_industry_2011CONSTRUCTION                                      6759.10
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES         -1671.89
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES  -3994.65
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE               6730.40
## emp_industry_2011INFORMATION AND COMMUNICATION                     5665.48
## emp_industry_2011MANUFACTURING                                     6906.92
## emp_industry_2011MINING                                           25441.24
## emp_industry_2011missing                                          -3887.45
## emp_industry_2011OTHER SERVICES                                   -4234.86
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES                 4492.57
## emp_industry_2011PUBLIC ADMINISTRATION                             9817.80
## emp_industry_2011RETAIL TRADE                                     -1428.69
## emp_industry_2011TRANSPORTATION AND WAREHOUSING                    6797.22
## emp_industry_2011UTILITIES                                        20796.20
## emp_industry_2011WHOLESALE TRADE                                   5338.84
## citizenshipunknown.birthplace                                     -2237.33
## citizenshipUnknown.not.us.born                                    -4336.43
## citizenshipUS.born.citizen                                        -8033.69
## sexFemale:raceHispanic                                            -7575.78
## sexFemale:raceMixed Race (Non-Hispanic)                             589.36
## sexFemale:raceNon-Black / Non-Hispanic                            -8128.30
##                                                                  Std. Error
## (Intercept)                                                         8890.11
## sexFemale                                                           1435.56
## raceHispanic                                                        1551.43
## raceMixed Race (Non-Hispanic)                                       5003.73
## raceNon-Black / Non-Hispanic                                        1242.39
## total.incarcerations                                                 631.95
## marital.statusMarried                                               1214.53
## marital.statusmissing                                               4594.98
## marital.statusNever married                                         1238.87
## marital.statusSeparated                                             2632.72
## marital.statusWidowed                                               6024.00
## work.limitationno                                                   4898.87
## work.limitationyes                                                  5123.18
## highest.degreeBachelor's degree (BA, BS)                            1474.27
## highest.degreeGED                                                   1691.50
## highest.degreeHigh school diploma 12 year                           1381.68
## highest.degreeMaster's degree (MA, MS)                              1982.47
## highest.degreemissing                                               2109.94
## highest.degreeNone                                                  1947.26
## highest.degreePhD                                                   7161.69
## highest.degreeProfessional degree (DDS, JD, MD)                     4794.69
## emp_industry_2011ACTIVE DUTY MILITARY                              10369.15
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES                9640.91
## emp_industry_2011CONSTRUCTION                                       8769.19
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES           8657.39
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES    8694.93
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE                8734.48
## emp_industry_2011INFORMATION AND COMMUNICATION                      8925.98
## emp_industry_2011MANUFACTURING                                      8731.63
## emp_industry_2011MINING                                             9817.50
## emp_industry_2011missing                                            8689.41
## emp_industry_2011OTHER SERVICES                                     8784.72
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES                  8687.17
## emp_industry_2011PUBLIC ADMINISTRATION                              8795.53
## emp_industry_2011RETAIL TRADE                                       8695.51
## emp_industry_2011TRANSPORTATION AND WAREHOUSING                     8872.84
## emp_industry_2011UTILITIES                                         10009.73
## emp_industry_2011WHOLESALE TRADE                                    8916.91
## citizenshipunknown.birthplace                                       5137.17
## citizenshipUnknown.not.us.born                                      5484.89
## citizenshipUS.born.citizen                                          5002.79
## sexFemale:raceHispanic                                              2067.36
## sexFemale:raceMixed Race (Non-Hispanic)                             7268.97
## sexFemale:raceNon-Black / Non-Hispanic                              1703.33
##                                                                  t value
## (Intercept)                                                        5.331
## sexFemale                                                         -4.687
## raceHispanic                                                       4.612
## raceMixed Race (Non-Hispanic)                                      1.289
## raceNon-Black / Non-Hispanic                                       8.389
## total.incarcerations                                              -5.083
## marital.statusMarried                                              2.678
## marital.statusmissing                                             -0.396
## marital.statusNever married                                       -2.548
## marital.statusSeparated                                           -0.036
## marital.statusWidowed                                             -0.292
## work.limitationno                                                  1.141
## work.limitationyes                                                -0.262
## highest.degreeBachelor's degree (BA, BS)                           6.401
## highest.degreeGED                                                 -7.792
## highest.degreeHigh school diploma 12 year                         -5.766
## highest.degreeMaster's degree (MA, MS)                             9.291
## highest.degreemissing                                              1.195
## highest.degreeNone                                                -9.265
## highest.degreePhD                                                  5.487
## highest.degreeProfessional degree (DDS, JD, MD)                    7.809
## emp_industry_2011ACTIVE DUTY MILITARY                              1.338
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES               0.301
## emp_industry_2011CONSTRUCTION                                      0.771
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES         -0.193
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES  -0.459
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE               0.771
## emp_industry_2011INFORMATION AND COMMUNICATION                     0.635
## emp_industry_2011MANUFACTURING                                     0.791
## emp_industry_2011MINING                                            2.591
## emp_industry_2011missing                                          -0.447
## emp_industry_2011OTHER SERVICES                                   -0.482
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES                 0.517
## emp_industry_2011PUBLIC ADMINISTRATION                             1.116
## emp_industry_2011RETAIL TRADE                                     -0.164
## emp_industry_2011TRANSPORTATION AND WAREHOUSING                    0.766
## emp_industry_2011UTILITIES                                         2.078
## emp_industry_2011WHOLESALE TRADE                                   0.599
## citizenshipunknown.birthplace                                     -0.436
## citizenshipUnknown.not.us.born                                    -0.791
## citizenshipUS.born.citizen                                        -1.606
## sexFemale:raceHispanic                                            -3.664
## sexFemale:raceMixed Race (Non-Hispanic)                            0.081
## sexFemale:raceNon-Black / Non-Hispanic                            -4.772
##                                                                  Pr(>|t|)    
## (Intercept)                                                      1.02e-07 ***
## sexFemale                                                        2.85e-06 ***
## raceHispanic                                                     4.09e-06 ***
## raceMixed Race (Non-Hispanic)                                     0.19751    
## raceNon-Black / Non-Hispanic                                      < 2e-16 ***
## total.incarcerations                                             3.86e-07 ***
## marital.statusMarried                                             0.00743 ** 
## marital.statusmissing                                             0.69240    
## marital.statusNever married                                       0.01086 *  
## marital.statusSeparated                                           0.97122    
## marital.statusWidowed                                             0.77037    
## work.limitationno                                                 0.25374    
## work.limitationyes                                                0.79361    
## highest.degreeBachelor's degree (BA, BS)                         1.69e-10 ***
## highest.degreeGED                                                8.01e-15 ***
## highest.degreeHigh school diploma 12 year                        8.61e-09 ***
## highest.degreeMaster's degree (MA, MS)                            < 2e-16 ***
## highest.degreemissing                                             0.23219    
## highest.degreeNone                                                < 2e-16 ***
## highest.degreePhD                                                4.29e-08 ***
## highest.degreeProfessional degree (DDS, JD, MD)                  7.03e-15 ***
## emp_industry_2011ACTIVE DUTY MILITARY                             0.18104    
## emp_industry_2011AGRICULTURE, FORESTRY AND FISHERIES              0.76343    
## emp_industry_2011CONSTRUCTION                                     0.44088    
## emp_industry_2011EDUCATIONAL, HEALTH, AND SOCIAL SERVICES         0.84687    
## emp_industry_2011ENTERTAINMENT, ACCOMODATIONS, AND FOOD SERVICES  0.64595    
## emp_industry_2011FINANCE, INSURANCE, AND REAL ESTATE              0.44101    
## emp_industry_2011INFORMATION AND COMMUNICATION                    0.52564    
## emp_industry_2011MANUFACTURING                                    0.42897    
## emp_industry_2011MINING                                           0.00959 ** 
## emp_industry_2011missing                                          0.65462    
## emp_industry_2011OTHER SERVICES                                   0.62978    
## emp_industry_2011PROFESSIONAL AND RELATED SERVICES                0.60507    
## emp_industry_2011PUBLIC ADMINISTRATION                            0.26438    
## emp_industry_2011RETAIL TRADE                                     0.86950    
## emp_industry_2011TRANSPORTATION AND WAREHOUSING                   0.44367    
## emp_industry_2011UTILITIES                                        0.03780 *  
## emp_industry_2011WHOLESALE TRADE                                  0.54938    
## citizenshipunknown.birthplace                                     0.66320    
## citizenshipUnknown.not.us.born                                    0.42921    
## citizenshipUS.born.citizen                                        0.10837    
## sexFemale:raceHispanic                                            0.00025 ***
## sexFemale:raceMixed Race (Non-Hispanic)                           0.93538    
## sexFemale:raceNon-Black / Non-Hispanic                           1.88e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24340 on 4889 degrees of freedom
## Multiple R-squared:  0.2845, Adjusted R-squared:  0.2782 
## F-statistic: 45.21 on 43 and 4889 DF,  p-value: < 2.2e-16

2. Collinearity

Checking for collinearity between hard.drugs and hard.times

We wanted to check for collinearity between hard.drugs and hard.times because we felt they captured a similar effect on income. Usage of hard drugs can be considered experiencing a hard time, and could mean collinearity between the two.

The pairs plot for hard.drugs and hard.times shows a cause for concern. The pair: hard.drugs = no, hard.times = no has a considerably larger box in the top-right plot. This box represents the group size of people who said no to both categories. Therefore, suggesting that either one would capture the effect of the other variable without having to include the other variable in the regression. This shows collinearity between hard.drugs and hard.times, allowing us to drop one of the two.

## 
## Call:
## lm(formula = income ~ sex + hard.times + hard.drugs, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52010 -20017  -4967  14983 104983 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    47099.8     2131.2  22.100  < 2e-16 ***
## sexFemale     -11994.9      795.8 -15.073  < 2e-16 ***
## hard.timesno     959.2     1271.9   0.754   0.4508    
## hard.timesyes  -8851.0     2201.5  -4.020  5.9e-05 ***
## hard.drugsno    3952.6     1899.7   2.081   0.0375 *  
## hard.drugsyes   1902.8     2463.8   0.772   0.4400    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27930 on 4927 degrees of freedom
## Multiple R-squared:  0.05005,    Adjusted R-squared:  0.04909 
## F-statistic: 51.92 on 5 and 4927 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: income ~ sex + hard.times
## Model 2: income ~ sex + hard.times + hard.drugs
##   Res.Df        RSS Df  Sum of Sq      F  Pr(>F)  
## 1   4929 3.8490e+12                               
## 2   4927 3.8447e+12  2 4329116365 2.7739 0.06252 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Model 1: income ~ sex + hard.drugs
## Model 2: income ~ sex + hard.times + hard.drugs
##   Res.Df        RSS Df  Sum of Sq      F  Pr(>F)    
## 1   4929 3.8656e+12                                 
## 2   4927 3.8447e+12  2 2.0884e+10 13.382 1.6e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = income ~ sex + hard.times, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51733 -19740  -4740  14265 105260 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      50548       1261  40.072  < 2e-16 ***
## sexFemale       -11994        796 -15.068  < 2e-16 ***
## hard.timesno      1187       1268   0.936    0.349    
## hard.timesyes    -8706       2201  -3.955 7.77e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27940 on 4929 degrees of freedom
## Multiple R-squared:  0.04898,    Adjusted R-squared:  0.0484 
## F-statistic: 84.62 on 3 and 4929 DF,  p-value: < 2.2e-16

The decision to drop either of the two: hard.drugs and hard.times depended on anova test to check the statistical significance of the remaining variable when one was dropped from the linear regression model. hard.times came out to be more statistically significant, therefore, hard.drugs was dropped.

3. Interaction Term

## 
## Call:
## lm(formula = income ~ sex, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51135 -20137  -4659  13863 105841 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  51136.7      559.2   91.44   <2e-16 ***
## sexFemale   -11977.6      797.9  -15.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28020 on 4931 degrees of freedom
## Multiple R-squared:  0.0437, Adjusted R-squared:  0.04351 
## F-statistic: 225.3 on 1 and 4931 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = income ~ sex + race, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -55177 -19327  -4233  14673 108767 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    42641.5      891.1  47.852  < 2e-16 ***
## sexFemale                     -11408.1      784.9 -14.535  < 2e-16 ***
## raceHispanic                    6480.3     1160.8   5.583  2.5e-08 ***
## raceMixed Race (Non-Hispanic)  11993.9     4090.4   2.932  0.00338 ** 
## raceNon-Black / Non-Hispanic   12685.7      953.1  13.310  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27520 on 4928 degrees of freedom
## Multiple R-squared:  0.07796,    Adjusted R-squared:  0.07721 
## F-statistic: 104.2 on 4 and 4928 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: income ~ sex
## Model 2: income ~ sex + race
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   4931 3.8704e+12                                   
## 2   4928 3.7318e+12  3 1.3863e+11 61.025 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = income ~ sex + race + sex * race, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56052 -19202  -4458  13798 106018 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                39458       1152  34.259  < 2e-16
## sexFemale                                  -5475       1572  -3.483 0.000501
## raceHispanic                               10719       1659   6.460 1.15e-10
## raceMixed Race (Non-Hispanic)              16782       5614   2.989 0.002810
## raceNon-Black / Non-Hispanic               16744       1368  12.242  < 2e-16
## sexFemale:raceHispanic                     -8085       2320  -3.485 0.000496
## sexFemale:raceMixed Race (Non-Hispanic)    -9361       8184  -1.144 0.252759
## sexFemale:raceNon-Black / Non-Hispanic     -7791       1905  -4.090 4.38e-05
##                                            
## (Intercept)                             ***
## sexFemale                               ***
## raceHispanic                            ***
## raceMixed Race (Non-Hispanic)           ** 
## raceNon-Black / Non-Hispanic            ***
## sexFemale:raceHispanic                  ***
## sexFemale:raceMixed Race (Non-Hispanic)    
## sexFemale:raceNon-Black / Non-Hispanic  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27470 on 4925 degrees of freedom
## Multiple R-squared:  0.0815, Adjusted R-squared:  0.08019 
## F-statistic: 62.43 on 7 and 4925 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: income ~ sex + race
## Model 2: income ~ sex + race + sex * race
##   Res.Df        RSS Df  Sum of Sq      F   Pr(>F)    
## 1   4928 3.7318e+12                                  
## 2   4925 3.7174e+12  3 1.4341e+10 6.3332 0.000278 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = income ~ sex + marital.status, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56702 -19491  -4914  14555 107624 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    47478       1300  36.513  < 2e-16 ***
## sexFemale                     -11988        784 -15.290  < 2e-16 ***
## marital.statusMarried           9423       1346   7.003 2.84e-12 ***
## marital.statusmissing          -4827       5162  -0.935    0.350    
## marital.statusNever married    -2034       1377  -1.477    0.140    
## marital.statusSeparated        -3115       2957  -1.053    0.292    
## marital.statusWidowed          -5900       6774  -0.871    0.384    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27460 on 4926 degrees of freedom
## Multiple R-squared:  0.08219,    Adjusted R-squared:  0.08108 
## F-statistic: 73.52 on 6 and 4926 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: income ~ sex
## Model 2: income ~ sex + marital.status
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   4931 3.8704e+12                                   
## 2   4926 3.7146e+12  5 1.5579e+11 41.318 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = income ~ sex + marital.status + sex * marital.status, 
##     data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59503 -19310  -4643  13836 106397 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                              46581       1854  25.130  < 2e-16 ***
## sexFemale                               -10417       2453  -4.247 2.21e-05 ***
## marital.statusMarried                    13122       2012   6.522 7.62e-11 ***
## marital.statusmissing                    -2804       6698  -0.419   0.6755    
## marital.statusNever married              -3978       2045  -1.945   0.0518 .  
## marital.statusSeparated                 -10490       4152  -2.526   0.0116 *  
## marital.statusWidowed                     9752      15874   0.614   0.5390    
## sexFemale:marital.statusMarried          -7299       2696  -2.708   0.0068 ** 
## sexFemale:marital.statusmissing          -4386      10468  -0.419   0.6753    
## sexFemale:marital.statusNever married     4456       2757   1.617   0.1060    
## sexFemale:marital.statusSeparated        15636       5894   2.653   0.0080 ** 
## sexFemale:marital.statusWidowed         -19488      17544  -1.111   0.2667    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27310 on 4921 degrees of freedom
## Multiple R-squared:  0.09344,    Adjusted R-squared:  0.09141 
## F-statistic: 46.11 on 11 and 4921 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: income ~ sex + marital.status
## Model 2: income ~ sex + marital.status + sex * marital.status
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   4926 3.7146e+12                                   
## 2   4921 3.6691e+12  5 4.5514e+10 12.209 8.765e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Model 1: income ~ sex
## Model 2: income ~ sex + hard.drugs
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)  
## 1   4931 3.8704e+12                             
## 2   4929 3.8656e+12  2 4.819e+09 3.0724 0.0464 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = income ~ sex + hard.drugs + sex * hard.drugs, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51466 -19468  -4465  14532 105535 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                45744       2567  17.817   <2e-16 ***
## sexFemale                  -8869       3704  -2.394   0.0167 *  
## hard.drugsno                5724       2634   2.173   0.0299 *  
## hard.drugsyes               4678       3470   1.348   0.1776    
## sexFemale:hard.drugsno     -3134       3799  -0.825   0.4095    
## sexFemale:hard.drugsyes    -5106       4936  -1.034   0.3010    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28010 on 4927 degrees of freedom
## Multiple R-squared:  0.0451, Adjusted R-squared:  0.04413 
## F-statistic: 46.54 on 5 and 4927 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: income ~ sex + hard.drugs
## Model 2: income ~ sex + hard.drugs + sex * hard.drugs
##   Res.Df        RSS Df Sum of Sq    F Pr(>F)
## 1   4929 3.8656e+12                         
## 2   4927 3.8647e+12  2 847072797 0.54 0.5828
##  [1] "total.incarcerations"  "marijuana"             "sex"                  
##  [4] "marital.status"        "high.school.diploma"   "parenthood.by.20"     
##  [7] "chance.college.degree" "hard.times"            "work.limitation"      
## [10] "citizenship"           "mother.edu"            "father.edu"           
## [13] "race"                  "hard.drugs"            "public.private"       
## [16] "highest.degree"        "emp_industry_2011"     "job.type"             
## [19] "income"
## Analysis of Variance Table
## 
## Model 1: income ~ sex
## Model 2: income ~ sex + hard.times
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   4931 3.8704e+12                                   
## 2   4929 3.8490e+12  2 2.1374e+10 13.686 1.183e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = income ~ sex + hard.times + sex * hard.times, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51883 -19583  -4583  14475 105417 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              50524.6     1719.8  29.378  < 2e-16 ***
## sexFemale               -11949.9     2384.9  -5.011 5.62e-07 ***
## hard.timesno              1360.2     1823.5   0.746 0.455753    
## hard.timesyes           -11294.0     3076.5  -3.671 0.000244 ***
## sexFemale:hard.timesno    -352.2     2537.7  -0.139 0.889636    
## sexFemale:hard.timesyes   5467.6     4407.4   1.241 0.214828    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27940 on 4927 degrees of freedom
## Multiple R-squared:  0.04943,    Adjusted R-squared:  0.04847 
## F-statistic: 51.25 on 5 and 4927 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: income ~ sex + hard.times
## Model 2: income ~ sex + hard.times + sex * hard.times
##   Res.Df        RSS Df  Sum of Sq     F Pr(>F)
## 1   4929 3.8490e+12                           
## 2   4927 3.8472e+12  2 1825592360 1.169 0.3108
##  [1] "total.incarcerations"  "marijuana"             "sex"                  
##  [4] "marital.status"        "high.school.diploma"   "parenthood.by.20"     
##  [7] "chance.college.degree" "hard.times"            "work.limitation"      
## [10] "citizenship"           "mother.edu"            "father.edu"           
## [13] "race"                  "hard.drugs"            "public.private"       
## [16] "highest.degree"        "emp_industry_2011"     "job.type"             
## [19] "income"
## Analysis of Variance Table
## 
## Model 1: income ~ sex
## Model 2: income ~ sex + highest.degree
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   4931 3.8704e+12                                   
## 2   4923 3.1779e+12  8 6.9246e+11 134.09 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = income ~ sex + highest.degree + sex * highest.degree, 
##     data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -77200 -17314  -3011  12818 108989 
## 
## Coefficients:
##                                                           Estimate Std. Error
## (Intercept)                                                53719.1     1936.7
## sexFemale                                                 -11243.5     2635.2
## highest.degreeBachelor's degree (BA, BS)                   13448.6     2257.1
## highest.degreeGED                                         -16339.6     2433.7
## highest.degreeHigh school diploma 12 year                  -6405.5     2078.5
## highest.degreeMaster's degree (MA, MS)                     21463.0     3257.8
## highest.degreemissing                                      -3474.9     2808.2
## highest.degreeNone                                        -18082.6     2776.4
## highest.degreePhD                                          39530.9    12846.3
## highest.degreeProfessional degree (DDS, JD, MD)            44270.4     7899.2
## sexFemale:highest.degreeBachelor's degree (BA, BS)         -4474.6     3058.2
## sexFemale:highest.degreeGED                                  148.1     3480.9
## sexFemale:highest.degreeHigh school diploma 12 year        -5058.7     2864.5
## sexFemale:highest.degreeMaster's degree (MA, MS)           -2910.0     4192.8
## sexFemale:highest.degreemissing                             1720.1     3948.5
## sexFemale:highest.degreeNone                               -6828.7     3992.7
## sexFemale:highest.degreePhD                                  493.5    15775.3
## sexFemale:highest.degreeProfessional degree (DDS, JD, MD)  -6046.1    10175.4
##                                                           t value Pr(>|t|)    
## (Intercept)                                                27.738  < 2e-16 ***
## sexFemale                                                  -4.267 2.02e-05 ***
## highest.degreeBachelor's degree (BA, BS)                    5.958 2.73e-09 ***
## highest.degreeGED                                          -6.714 2.11e-11 ***
## highest.degreeHigh school diploma 12 year                  -3.082  0.00207 ** 
## highest.degreeMaster's degree (MA, MS)                      6.588 4.92e-11 ***
## highest.degreemissing                                      -1.237  0.21599    
## highest.degreeNone                                         -6.513 8.10e-11 ***
## highest.degreePhD                                           3.077  0.00210 ** 
## highest.degreeProfessional degree (DDS, JD, MD)             5.604 2.20e-08 ***
## sexFemale:highest.degreeBachelor's degree (BA, BS)         -1.463  0.14349    
## sexFemale:highest.degreeGED                                 0.043  0.96607    
## sexFemale:highest.degreeHigh school diploma 12 year        -1.766  0.07745 .  
## sexFemale:highest.degreeMaster's degree (MA, MS)           -0.694  0.48768    
## sexFemale:highest.degreemissing                             0.436  0.66313    
## sexFemale:highest.degreeNone                               -1.710  0.08727 .  
## sexFemale:highest.degreePhD                                 0.031  0.97505    
## sexFemale:highest.degreeProfessional degree (DDS, JD, MD)  -0.594  0.55242    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25400 on 4915 degrees of freedom
## Multiple R-squared:  0.2166, Adjusted R-squared:  0.2139 
## F-statistic: 79.93 on 17 and 4915 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: income ~ sex + highest.degree
## Model 2: income ~ sex + highest.degree + sex * highest.degree
##   Res.Df        RSS Df  Sum of Sq      F Pr(>F)
## 1   4923 3.1779e+12                            
## 2   4915 3.1707e+12  8 7221855249 1.3993 0.1912
##  [1] "total.incarcerations"  "marijuana"             "sex"                  
##  [4] "marital.status"        "high.school.diploma"   "parenthood.by.20"     
##  [7] "chance.college.degree" "hard.times"            "work.limitation"      
## [10] "citizenship"           "mother.edu"            "father.edu"           
## [13] "race"                  "hard.drugs"            "public.private"       
## [16] "highest.degree"        "emp_industry_2011"     "job.type"             
## [19] "income"
## Analysis of Variance Table
## 
## Model 1: income ~ sex
## Model 2: income ~ sex + job.type
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   4931 3.8704e+12                                   
## 2   4899 3.0712e+12 32 7.9921e+11 39.839 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = income ~ sex + job.type + sex * job.type, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66273 -16348  -3289  13189 110201 
## 
## Coefficients: (2 not defined because of singularities)
##                                                                           Estimate
## (Intercept)                                                                34896.3
## sexFemale                                                                  -1229.6
## job.typeCLEANING AND BUILDING SERVICE                                      -2675.6
## job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS                         13718.4
## job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS                           9241.9
## job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS                           14831.0
## job.typeENGINEERING AND RELATED TECHNICIANS                                21633.1
## job.typeENGINEERS, ARCHITECTS, AND SURVEYORS                               50945.8
## job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS            10416.2
## job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS                       -2896.3
## job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL                           33262.2
## job.typeFARMING, FISHING, AND FORESTRY                                     23119.1
## job.typeFOOD PREPARATION                                                  -11696.3
## job.typeFOOD PREPARATIONS AND SERVING RELATED                              -5484.0
## job.typeHEALTH CARE TECHNICAL AND SUPPORT                                   6318.0
## job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS                        43787.7
## job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS                      17536.8
## job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS                         40385.1
## job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS                     26853.7
## job.typeMANAGEMENT RELATED                                                 34201.1
## job.typeMATHEMATICAL AND COMPUTER SCIENTISTS                               33877.0
## job.typeMEDIA AND COMMUNICATION WORKERS                                    30389.4
## job.typeMILITARY SPECIFIC OCCUPATIONS                                      38334.5
## job.typemissing                                                             9675.6
## job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS                           8030.6
## job.typePERSONAL CARE AND SERVICE WORKERS                                   -396.3
## job.typePHYSICAL SCIENTISTS                                                 9025.9
## job.typePRODUCTION AND OPERATING WORKERS                                    8299.3
## job.typePROTECTIVE SERVICE                                                 29191.9
## job.typeSALES AND RELATED WORKERS                                          18137.8
## job.typeSETTER, OPERATORS, AND TENDERS                                      9097.1
## job.typeSOCIAL SCIENTISTS AND RELATED WORKERS                              13103.7
## job.typeTEACHERS                                                           12959.6
## job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS                          4087.4
## sexFemale:job.typeCLEANING AND BUILDING SERVICE                           -15132.6
## sexFemale:job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS              -23135.0
## sexFemale:job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS                 1022.8
## sexFemale:job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS                -26249.5
## sexFemale:job.typeENGINEERING AND RELATED TECHNICIANS                      -8299.8
## sexFemale:job.typeENGINEERS, ARCHITECTS, AND SURVEYORS                      9530.4
## sexFemale:job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS  -1997.6
## sexFemale:job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS              -859.4
## sexFemale:job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL                -10118.3
## sexFemale:job.typeFARMING, FISHING, AND FORESTRY                          -32357.2
## sexFemale:job.typeFOOD PREPARATION                                         -4220.4
## sexFemale:job.typeFOOD PREPARATIONS AND SERVING RELATED                    -4942.1
## sexFemale:job.typeHEALTH CARE TECHNICAL AND SUPPORT                       -10346.6
## sexFemale:job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS             -15535.4
## sexFemale:job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS           -13036.8
## sexFemale:job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS              -15927.1
## sexFemale:job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS                NA
## sexFemale:job.typeMANAGEMENT RELATED                                      -11841.4
## sexFemale:job.typeMATHEMATICAL AND COMPUTER SCIENTISTS                     -3905.7
## sexFemale:job.typeMEDIA AND COMMUNICATION WORKERS                         -25960.7
## sexFemale:job.typeMILITARY SPECIFIC OCCUPATIONS                                 NA
## sexFemale:job.typemissing                                                  -9008.7
## sexFemale:job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS                -7908.3
## sexFemale:job.typePERSONAL CARE AND SERVICE WORKERS                        -9921.7
## sexFemale:job.typePHYSICAL SCIENTISTS                                      19140.7
## sexFemale:job.typePRODUCTION AND OPERATING WORKERS                        -11278.4
## sexFemale:job.typePROTECTIVE SERVICE                                      -23290.4
## sexFemale:job.typeSALES AND RELATED WORKERS                               -17005.9
## sexFemale:job.typeSETTER, OPERATORS, AND TENDERS                          -12584.6
## sexFemale:job.typeSOCIAL SCIENTISTS AND RELATED WORKERS                     3377.9
## sexFemale:job.typeTEACHERS                                                 -5202.9
## sexFemale:job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS              -11405.8
##                                                                           Std. Error
## (Intercept)                                                                   9434.5
## sexFemale                                                                    17224.9
## job.typeCLEANING AND BUILDING SERVICE                                         9864.9
## job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS                            9575.7
## job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS                            10360.2
## job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS                             12068.6
## job.typeENGINEERING AND RELATED TECHNICIANS                                  11209.8
## job.typeENGINEERS, ARCHITECTS, AND SURVEYORS                                 10266.7
## job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS              10548.1
## job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS                         12918.7
## job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL                              9558.7
## job.typeFARMING, FISHING, AND FORESTRY                                       11702.0
## job.typeFOOD PREPARATION                                                     12301.0
## job.typeFOOD PREPARATIONS AND SERVING RELATED                                 9710.3
## job.typeHEALTH CARE TECHNICAL AND SUPPORT                                    10190.4
## job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS                          10774.9
## job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS                         9620.3
## job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS                           11036.4
## job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS                       15645.3
## job.typeMANAGEMENT RELATED                                                    9730.0
## job.typeMATHEMATICAL AND COMPUTER SCIENTISTS                                  9727.4
## job.typeMEDIA AND COMMUNICATION WORKERS                                      10894.0
## job.typeMILITARY SPECIFIC OCCUPATIONS                                        11702.0
## job.typemissing                                                               9838.6
## job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS                             9599.8
## job.typePERSONAL CARE AND SERVICE WORKERS                                    10334.9
## job.typePHYSICAL SCIENTISTS                                                  12579.3
## job.typePRODUCTION AND OPERATING WORKERS                                     10141.7
## job.typePROTECTIVE SERVICE                                                    9755.9
## job.typeSALES AND RELATED WORKERS                                             9599.8
## job.typeSETTER, OPERATORS, AND TENDERS                                        9685.1
## job.typeSOCIAL SCIENTISTS AND RELATED WORKERS                                12579.3
## job.typeTEACHERS                                                              9978.5
## job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS                            9555.1
## sexFemale:job.typeCLEANING AND BUILDING SERVICE                              17839.9
## sexFemale:job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS                 19018.1
## sexFemale:job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS                  17987.7
## sexFemale:job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS                   19380.2
## sexFemale:job.typeENGINEERING AND RELATED TECHNICIANS                        25394.5
## sexFemale:job.typeENGINEERS, ARCHITECTS, AND SURVEYORS                       20052.5
## sexFemale:job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS    18365.0
## sexFemale:job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS               20228.6
## sexFemale:job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL                   17366.1
## sexFemale:job.typeFARMING, FISHING, AND FORESTRY                             20823.9
## sexFemale:job.typeFOOD PREPARATION                                           22688.5
## sexFemale:job.typeFOOD PREPARATIONS AND SERVING RELATED                      17512.8
## sexFemale:job.typeHEALTH CARE TECHNICAL AND SUPPORT                          17742.4
## sexFemale:job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS                18115.6
## sexFemale:job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS              20101.8
## sexFemale:job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS                 18883.3
## sexFemale:job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS                  NA
## sexFemale:job.typeMANAGEMENT RELATED                                         17508.4
## sexFemale:job.typeMATHEMATICAL AND COMPUTER SCIENTISTS                       17994.3
## sexFemale:job.typeMEDIA AND COMMUNICATION WORKERS                            18580.8
## sexFemale:job.typeMILITARY SPECIFIC OCCUPATIONS                                   NA
## sexFemale:job.typemissing                                                    17841.8
## sexFemale:job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS                  17353.3
## sexFemale:job.typePERSONAL CARE AND SERVICE WORKERS                          17876.4
## sexFemale:job.typePHYSICAL SCIENTISTS                                        20441.3
## sexFemale:job.typePRODUCTION AND OPERATING WORKERS                           18315.7
## sexFemale:job.typePROTECTIVE SERVICE                                         17867.9
## sexFemale:job.typeSALES AND RELATED WORKERS                                  17402.3
## sexFemale:job.typeSETTER, OPERATORS, AND TENDERS                             17733.3
## sexFemale:job.typeSOCIAL SCIENTISTS AND RELATED WORKERS                      20441.3
## sexFemale:job.typeTEACHERS                                                   17618.8
## sexFemale:job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS                 17662.6
##                                                                           t value
## (Intercept)                                                                 3.699
## sexFemale                                                                  -0.071
## job.typeCLEANING AND BUILDING SERVICE                                      -0.271
## job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS                          1.433
## job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS                           0.892
## job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS                            1.229
## job.typeENGINEERING AND RELATED TECHNICIANS                                 1.930
## job.typeENGINEERS, ARCHITECTS, AND SURVEYORS                                4.962
## job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS             0.988
## job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS                       -0.224
## job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL                            3.480
## job.typeFARMING, FISHING, AND FORESTRY                                      1.976
## job.typeFOOD PREPARATION                                                   -0.951
## job.typeFOOD PREPARATIONS AND SERVING RELATED                              -0.565
## job.typeHEALTH CARE TECHNICAL AND SUPPORT                                   0.620
## job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS                         4.064
## job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS                       1.823
## job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS                          3.659
## job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS                      1.716
## job.typeMANAGEMENT RELATED                                                  3.515
## job.typeMATHEMATICAL AND COMPUTER SCIENTISTS                                3.483
## job.typeMEDIA AND COMMUNICATION WORKERS                                     2.790
## job.typeMILITARY SPECIFIC OCCUPATIONS                                       3.276
## job.typemissing                                                             0.983
## job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS                           0.837
## job.typePERSONAL CARE AND SERVICE WORKERS                                  -0.038
## job.typePHYSICAL SCIENTISTS                                                 0.718
## job.typePRODUCTION AND OPERATING WORKERS                                    0.818
## job.typePROTECTIVE SERVICE                                                  2.992
## job.typeSALES AND RELATED WORKERS                                           1.889
## job.typeSETTER, OPERATORS, AND TENDERS                                      0.939
## job.typeSOCIAL SCIENTISTS AND RELATED WORKERS                               1.042
## job.typeTEACHERS                                                            1.299
## job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS                          0.428
## sexFemale:job.typeCLEANING AND BUILDING SERVICE                            -0.848
## sexFemale:job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS               -1.216
## sexFemale:job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS                 0.057
## sexFemale:job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS                 -1.354
## sexFemale:job.typeENGINEERING AND RELATED TECHNICIANS                      -0.327
## sexFemale:job.typeENGINEERS, ARCHITECTS, AND SURVEYORS                      0.475
## sexFemale:job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS  -0.109
## sexFemale:job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS             -0.042
## sexFemale:job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL                 -0.583
## sexFemale:job.typeFARMING, FISHING, AND FORESTRY                           -1.554
## sexFemale:job.typeFOOD PREPARATION                                         -0.186
## sexFemale:job.typeFOOD PREPARATIONS AND SERVING RELATED                    -0.282
## sexFemale:job.typeHEALTH CARE TECHNICAL AND SUPPORT                        -0.583
## sexFemale:job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS              -0.858
## sexFemale:job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS            -0.649
## sexFemale:job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS               -0.843
## sexFemale:job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS               NA
## sexFemale:job.typeMANAGEMENT RELATED                                       -0.676
## sexFemale:job.typeMATHEMATICAL AND COMPUTER SCIENTISTS                     -0.217
## sexFemale:job.typeMEDIA AND COMMUNICATION WORKERS                          -1.397
## sexFemale:job.typeMILITARY SPECIFIC OCCUPATIONS                                NA
## sexFemale:job.typemissing                                                  -0.505
## sexFemale:job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS                -0.456
## sexFemale:job.typePERSONAL CARE AND SERVICE WORKERS                        -0.555
## sexFemale:job.typePHYSICAL SCIENTISTS                                       0.936
## sexFemale:job.typePRODUCTION AND OPERATING WORKERS                         -0.616
## sexFemale:job.typePROTECTIVE SERVICE                                       -1.303
## sexFemale:job.typeSALES AND RELATED WORKERS                                -0.977
## sexFemale:job.typeSETTER, OPERATORS, AND TENDERS                           -0.710
## sexFemale:job.typeSOCIAL SCIENTISTS AND RELATED WORKERS                     0.165
## sexFemale:job.typeTEACHERS                                                 -0.295
## sexFemale:job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS               -0.646
##                                                                           Pr(>|t|)
## (Intercept)                                                               0.000219
## sexFemale                                                                 0.943093
## job.typeCLEANING AND BUILDING SERVICE                                     0.786229
## job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS                        0.152032
## job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS                         0.372405
## job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS                          0.219173
## job.typeENGINEERING AND RELATED TECHNICIANS                               0.053685
## job.typeENGINEERS, ARCHITECTS, AND SURVEYORS                              7.21e-07
## job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS           0.323446
## job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS                      0.822616
## job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL                          0.000506
## job.typeFARMING, FISHING, AND FORESTRY                                    0.048251
## job.typeFOOD PREPARATION                                                  0.341734
## job.typeFOOD PREPARATIONS AND SERVING RELATED                             0.572261
## job.typeHEALTH CARE TECHNICAL AND SUPPORT                                 0.535290
## job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS                       4.90e-05
## job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS                     0.068379
## job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS                        0.000256
## job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS                    0.086151
## job.typeMANAGEMENT RELATED                                                0.000444
## job.typeMATHEMATICAL AND COMPUTER SCIENTISTS                              0.000501
## job.typeMEDIA AND COMMUNICATION WORKERS                                   0.005298
## job.typeMILITARY SPECIFIC OCCUPATIONS                                     0.001061
## job.typemissing                                                           0.325443
## job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS                         0.402895
## job.typePERSONAL CARE AND SERVICE WORKERS                                 0.969415
## job.typePHYSICAL SCIENTISTS                                               0.473086
## job.typePRODUCTION AND OPERATING WORKERS                                  0.413210
## job.typePROTECTIVE SERVICE                                                0.002783
## job.typeSALES AND RELATED WORKERS                                         0.058898
## job.typeSETTER, OPERATORS, AND TENDERS                                    0.347633
## job.typeSOCIAL SCIENTISTS AND RELATED WORKERS                             0.297607
## job.typeTEACHERS                                                          0.194087
## job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS                        0.668834
## sexFemale:job.typeCLEANING AND BUILDING SERVICE                           0.396344
## sexFemale:job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS              0.223864
## sexFemale:job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS               0.954656
## sexFemale:job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS                0.175655
## sexFemale:job.typeENGINEERING AND RELATED TECHNICIANS                     0.743807
## sexFemale:job.typeENGINEERS, ARCHITECTS, AND SURVEYORS                    0.634615
## sexFemale:job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS 0.913389
## sexFemale:job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS            0.966115
## sexFemale:job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL                0.560158
## sexFemale:job.typeFARMING, FISHING, AND FORESTRY                          0.120285
## sexFemale:job.typeFOOD PREPARATION                                        0.852442
## sexFemale:job.typeFOOD PREPARATIONS AND SERVING RELATED                   0.777802
## sexFemale:job.typeHEALTH CARE TECHNICAL AND SUPPORT                       0.559817
## sexFemale:job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS             0.391172
## sexFemale:job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS           0.516666
## sexFemale:job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS              0.399018
## sexFemale:job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS                NA
## sexFemale:job.typeMANAGEMENT RELATED                                      0.498865
## sexFemale:job.typeMATHEMATICAL AND COMPUTER SCIENTISTS                    0.828177
## sexFemale:job.typeMEDIA AND COMMUNICATION WORKERS                         0.162424
## sexFemale:job.typeMILITARY SPECIFIC OCCUPATIONS                                 NA
## sexFemale:job.typemissing                                                 0.613640
## sexFemale:job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS               0.648607
## sexFemale:job.typePERSONAL CARE AND SERVICE WORKERS                       0.578906
## sexFemale:job.typePHYSICAL SCIENTISTS                                     0.349128
## sexFemale:job.typePRODUCTION AND OPERATING WORKERS                        0.538071
## sexFemale:job.typePROTECTIVE SERVICE                                      0.192474
## sexFemale:job.typeSALES AND RELATED WORKERS                               0.328508
## sexFemale:job.typeSETTER, OPERATORS, AND TENDERS                          0.477951
## sexFemale:job.typeSOCIAL SCIENTISTS AND RELATED WORKERS                   0.868757
## sexFemale:job.typeTEACHERS                                                0.767775
## sexFemale:job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS              0.518468
##                                                                              
## (Intercept)                                                               ***
## sexFemale                                                                    
## job.typeCLEANING AND BUILDING SERVICE                                        
## job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS                           
## job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS                            
## job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS                             
## job.typeENGINEERING AND RELATED TECHNICIANS                               .  
## job.typeENGINEERS, ARCHITECTS, AND SURVEYORS                              ***
## job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS              
## job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS                         
## job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL                          ***
## job.typeFARMING, FISHING, AND FORESTRY                                    *  
## job.typeFOOD PREPARATION                                                     
## job.typeFOOD PREPARATIONS AND SERVING RELATED                                
## job.typeHEALTH CARE TECHNICAL AND SUPPORT                                    
## job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS                       ***
## job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS                     .  
## job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS                        ***
## job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS                    .  
## job.typeMANAGEMENT RELATED                                                ***
## job.typeMATHEMATICAL AND COMPUTER SCIENTISTS                              ***
## job.typeMEDIA AND COMMUNICATION WORKERS                                   ** 
## job.typeMILITARY SPECIFIC OCCUPATIONS                                     ** 
## job.typemissing                                                              
## job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS                            
## job.typePERSONAL CARE AND SERVICE WORKERS                                    
## job.typePHYSICAL SCIENTISTS                                                  
## job.typePRODUCTION AND OPERATING WORKERS                                     
## job.typePROTECTIVE SERVICE                                                ** 
## job.typeSALES AND RELATED WORKERS                                         .  
## job.typeSETTER, OPERATORS, AND TENDERS                                       
## job.typeSOCIAL SCIENTISTS AND RELATED WORKERS                                
## job.typeTEACHERS                                                             
## job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS                           
## sexFemale:job.typeCLEANING AND BUILDING SERVICE                              
## sexFemale:job.typeCONSTRUCTION TRADES AND EXTRACTION WORKERS                 
## sexFemale:job.typeCOUNSELORS, SOCIAL, AND RELIGIOUS WORKERS                  
## sexFemale:job.typeEDUCATION, TRAINING, AND LIBRARY WORKERS                   
## sexFemale:job.typeENGINEERING AND RELATED TECHNICIANS                        
## sexFemale:job.typeENGINEERS, ARCHITECTS, AND SURVEYORS                       
## sexFemale:job.typeENTERTAINERS AND PERFORMERS, SPORTS AND RELATED WORKERS    
## sexFemale:job.typeENTERTAINMENT ATTENDANTS AND RELATED WORKERS               
## sexFemale:job.typeEXECUTIVE, ADMINISTRATIVE AND MANAGERIAL                   
## sexFemale:job.typeFARMING, FISHING, AND FORESTRY                             
## sexFemale:job.typeFOOD PREPARATION                                           
## sexFemale:job.typeFOOD PREPARATIONS AND SERVING RELATED                      
## sexFemale:job.typeHEALTH CARE TECHNICAL AND SUPPORT                          
## sexFemale:job.typeHEALTH DIAGNOSIS AND TREATING PRACTITIONERS                
## sexFemale:job.typeINSTALLATION, MAINTENANCE, AND REPAIR WORKERS              
## sexFemale:job.typeLAWYERS, JUDGES, AND LEGAL SUPPORT WORKERS                 
## sexFemale:job.typeLIFE, PHYSICAL, AND SOCIAL SCIENCE TECHNICIANS             
## sexFemale:job.typeMANAGEMENT RELATED                                         
## sexFemale:job.typeMATHEMATICAL AND COMPUTER SCIENTISTS                       
## sexFemale:job.typeMEDIA AND COMMUNICATION WORKERS                            
## sexFemale:job.typeMILITARY SPECIFIC OCCUPATIONS                              
## sexFemale:job.typemissing                                                    
## sexFemale:job.typeOFFICE AND ADMINISTRATIVE SUPPORT WORKERS                  
## sexFemale:job.typePERSONAL CARE AND SERVICE WORKERS                          
## sexFemale:job.typePHYSICAL SCIENTISTS                                        
## sexFemale:job.typePRODUCTION AND OPERATING WORKERS                           
## sexFemale:job.typePROTECTIVE SERVICE                                         
## sexFemale:job.typeSALES AND RELATED WORKERS                                  
## sexFemale:job.typeSETTER, OPERATORS, AND TENDERS                             
## sexFemale:job.typeSOCIAL SCIENTISTS AND RELATED WORKERS                      
## sexFemale:job.typeTEACHERS                                                   
## sexFemale:job.typeTRANSPORTATION AND MATERIAL MOVING WORKERS                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24960 on 4869 degrees of freedom
## Multiple R-squared:  0.2504, Adjusted R-squared:  0.2407 
## F-statistic: 25.82 on 63 and 4869 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: income ~ sex + job.type
## Model 2: income ~ sex + job.type + sex * job.type
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   4899 3.0712e+12                                   
## 2   4869 3.0337e+12 30 3.7488e+10 2.0056 0.0009284 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1391  0.0000 11.0000
## 
## Call:
## lm(formula = income ~ sex, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51135 -20137  -4659  13863 105841 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  51136.7      559.2   91.44   <2e-16 ***
## sexFemale   -11977.6      797.9  -15.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28020 on 4931 degrees of freedom
## Multiple R-squared:  0.0437, Adjusted R-squared:  0.04351 
## F-statistic: 225.3 on 1 and 4931 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = income ~ sex + total.incarcerations, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52705 -19484  -4855  14516 105516 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           52855.3      575.4   91.85   <2e-16 ***
## sexFemale            -13370.9      799.3  -16.73   <2e-16 ***
## total.incarcerations  -7437.5      691.2  -10.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27700 on 4930 degrees of freedom
## Multiple R-squared:  0.06564,    Adjusted R-squared:  0.06526 
## F-statistic: 173.2 on 2 and 4930 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: income ~ sex
## Model 2: income ~ sex + total.incarcerations
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   4931 3.8704e+12                                   
## 2   4930 3.7816e+12  1 8.8801e+10 115.77 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = income ~ sex + total.incarcerations + sex * total.incarcerations, 
##     data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52678 -19517  -4517  14483 105483 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     52828.4      579.0  91.245   <2e-16 ***
## sexFemale                      -13311.8      811.5 -16.404   <2e-16 ***
## total.incarcerations            -7321.0      744.0  -9.840   <2e-16 ***
## sexFemale:total.incarcerations   -852.1     2012.7  -0.423    0.672    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27700 on 4929 degrees of freedom
## Multiple R-squared:  0.06568,    Adjusted R-squared:  0.06511 
## F-statistic: 115.5 on 3 and 4929 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: income ~ sex + total.incarcerations
## Model 2: income ~ sex + total.incarcerations + sex * total.incarcerations
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1   4930 3.7816e+12                           
## 2   4929 3.7815e+12  1 137502914 0.1792 0.6721

Interaction Terms a) Checking for statistically significant interactions with variable sex

We checked for interaction of all the variables shortlisted for the final lm with variable sex. We found only 3 interaction terms to be statistically significant: 1) sexrace 2) sexmarital.status 3) sex*highest.degree

b) Shortlisting interactions with variable sex to be added to the linear model

Following is the discussion on the 3 statistically significant interactions with the variable sex, with the purpose of rejecting or accepting participation in the final lm:

  1. sex*race: Sex and race appear to have a compounding effect on the income variable.Sex variable has a statistically significant impact on the income, showcased in the income gap between men and women. This income gap can be further impacted by differences in the racial groups, owing to differences in social constraints upon the the different sexes in different racial groups. For example, on average brown women may find themselves impacted considerably greater by domestic and household pressures compared to white women. The former’s social set-up requires them to focus on domestic chores and child-rearing and provides brown men greater freedom in pursuing professional careers resulting in a wider income gap. Therefore, it would make sense to include the statistically significant sex and race interaction term in the linear model to check for impact on income gap.

  2. sex*highest.degree: Highest degree is an important variable defining education in our linear regression model. It can have a compounded effect along with the sex variable on income. For example, an already disadvantaged sex/gender in terms of income may be further disadvantaged with lower education levels. For similar qualifications women tend to earn less than men in the US, which serves as an indicator of a compounded effect of sex and education on income. One study indicates that the income gap widens with a college degree, that is income gap between men and women with a college degree is wider than the income gap between men and women without a college degree, serving further evidence for the need of an interaction term in our regression model (Day, 2019). Therefore, it would make sense to include the statistically significant sex and highest.degree interaction term in the linear model to check for impact on the income gap. While this seems interesting, we have chosen to ignore it to avoid complexities.

Sources: Day, J. C. (2019, May 29). College Degree Widens Gender Earnings Gap - U.S. Census Bureau. Retrieved November 13, 2020, from https://www.census.gov/library/stories/2019/05/college-degree-widens-gender-earnings-gap.html

  1. sex*marital.status: Some factors of Marital Status were statistically insignificant, causing concern for any misrepresentation of findings prompted by the sex and marital.status interaction term. So it doesnot make sense to include the sex and marital status interaction term, even though statistically significant, in the linear regression model.

5. Discussion

##  [1] "total.incarcerations"  "marijuana"             "sex"                  
##  [4] "marital.status"        "high.school.diploma"   "parenthood.by.20"     
##  [7] "chance.college.degree" "hard.times"            "work.limitation"      
## [10] "citizenship"           "mother.edu"            "father.edu"           
## [13] "race"                  "hard.drugs"            "public.private"       
## [16] "highest.degree"        "emp_industry_2011"     "job.type"             
## [19] "income"
## 
## Call:
## lm(formula = income ~ sex + marital.status, data = renamed_nlsy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56702 -19491  -4914  14555 107624 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    47478       1300  36.513  < 2e-16 ***
## sexFemale                     -11988        784 -15.290  < 2e-16 ***
## marital.statusMarried           9423       1346   7.003 2.84e-12 ***
## marital.statusmissing          -4827       5162  -0.935    0.350    
## marital.statusNever married    -2034       1377  -1.477    0.140    
## marital.statusSeparated        -3115       2957  -1.053    0.292    
## marital.statusWidowed          -5900       6774  -0.871    0.384    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27460 on 4926 degrees of freedom
## Multiple R-squared:  0.08219,    Adjusted R-squared:  0.08108 
## F-statistic: 73.52 on 6 and 4926 DF,  p-value: < 2.2e-16
  1. The coefficient of sexFemale indicates that females earn $12,676 lower that males, which is the baseline in our dataset.

  2. The coefficient for raceHispanic indicates that individuals in Hispanic race on average earn $7,154 more than Black individuals.

  3. The coefficient for raceNon-Black/ Non-Hispanic indicates that individuals in this race earn $10,422 greater than Black individuals.

  4. The coefficient for total.incarcerations indicates that for each additional incarceration, the income falls by $3212.

  5. The coefficient for marital.statusMarried shows that individuals who are married earn $3,252 more than individuals who are divorced.

  6. The coefficient for marital.statusNever_married shows that individuals who are have never gotten married earn $3,156 less than individuals who are divorced.

  7. We can see that the factor level UTILITIES in emp_industry_2011 (which describes the employment industry)’s coefficient is statistically significant. On average the coefficient increases by $20796 when compared to the baseline of ACS SPECIAL CODE.

  8. We can see that the factor level MINING in emp_industry_2011 (which describes the employment industry)’s coefficient is statistically significant. On average the coefficient increases by $25441 when compared to the baseline of ACS SPECIAL CODE.

  9. We can see that the factor level Professional Degree in highest degree (which describes the highest degree a person has attained)’s coefficient is statistically significant. On average the coefficient increases by $37439 when compared to the baseline of Associate Junior College.

  10. highest degree’s factor PhD is also very significant with a p-value of 4.29e-08. Its coefficient increases by $39296 when compared to the baseline of Associate Junior College.

  11. highest degree’s factor None is also very significant with a p-value of less than 2e-16. Its coefficient decreases by $18042 when compared to the baseline of Associate Junior College.

  12. highest degree’s factor Master’s degree MA/MS is also very significant with a p-value of less than 2e-16. Its coefficient increases by $18418 when compared to the baseline of Associate Junior College.

  13. highest degree’s factor High school diploma is also very significant with a p-value of 8.61e-09. Its coefficient decreases by 7967 when compared to the baseline of Associate Junior College.

  14. highest degree’s factor GED is also very significant with a p-value of 8.01e-15. Its coefficient decreases by 13179 when compared to the baseline of Associate Junior College.

  15. highest degree’s factor Bachelor’s degree is also very significant with a p-value of less than 1.69e-10. Its coefficient increases by 9436 when compared to the baseline of Associate Junior College.

  16. The sex and race interaction term gives an interesting perspective in terms of describing the income gap between men and women. The sexFemale:raceHispanic variable and sexFemale:raceNon-Black/Non-Hispanic have coefficients -7575.78 and -8128.30, that are statistically significant at p-value = 0.00025 and p-value = 1.88e-06, respectively. The sexFemale:raceHispanic suggests that the income gap between men and women of the race Hispanic is on average 7575.8 dollars less than the income gap of our baseline race black. The sexFemale:raceNon-Black/Non-Hispanic suggests that the income gap between men and women of the race raceNon-Black / Non-Hispanic is on average 8128.3 dollars less than the income gap of our baseline race black.

All other variables were found to be statistically insignificant at the 0.05 significance level. #### a. Analyzing Diagnostic Plots

lm.final <- lm(income ~ sex + race +  sex*race + highest.degree + 
                 total.incarcerations + marital.status + hard.times +
                 citizenship + emp_industry_2011 + job.type, data=renamed_nlsy)

plot(lm.final)

  1. Residual v/s Fitted: For every observation in the data, the linear regression model predicts a fitted value. This is basically the estimate of income based on the model which was fit. The residual is the difference between the fitted value and the actual observed value of income. We are interested in assessing the non-linearity here and check whether the constant variance assumption is violated. We can see that the model is a good fit given the average residual as indicated by the red line falls close to 0, showing that the difference between the actual v/s fitted value is minimal.

  2. Normal Q-Q plot: The residuals appear to be following the line well except on the upper tail where there is a clear deviation from normality. This may be a result of having majority of categorical variables in our data which causes our income to take select values only.

  3. Residual v/s Leverage: A problematic outlier for us would be one which has high residual and high leverage. This means that the point is away from what the line predicts, and has a high influence on the linear model that we fit. However, since we do not have any points lying outside our Cook’s distance, there are no points which are concerning outliers for us.

  4. The scale-location provides a clearer picture of non-constant variance. The upward sloping line tells us that there is non-constant variance present in the model, or in other words the error of the model is not constant across all observations. Although, ideally we should have a flat line centered at 1 i.e. the ability of the model to predict income does not vary across values. However, the scenario here says that as predicted values get higher so does our error, which may question our inferences.